Deep Autoregressive Network (ICML’14)

This is one of the early paper on generative modeling. This work was on arXiv since Oct 2013 before the reparameterization trick has been popularized [2, 3]. It is interesting to look back and see the challenge of training the model with stochastic layers.

Model Architecture

Deep Autoregressive Network (DARN) is a flexible deep generative model with the following architecture features: First, its stochastic layer is stackable. This improves representational power. Second, the deterministic layers can be inserted between the stochastic layers to add complexity to the model.  Third, the generative model such as NADE and EoNADE can be used instead of the simple linear autoregressive. This also improves the representational power.



The main difference from VAE [2] is that the hidden units are binary vectors (which is similar to the restricted Boltzmann machine). VAE requires a continuous vector as hidden units unless we approximate the discrete units with Gumbel-Softmax.

DARN does not assume any form of distribution on its prior p(h) or conditional distribution p(x|h) and q(h|x). The vanilla VAE assumes a standard Gaussian distribution with the diagonal covariance matrix. This could be either good or bad thing for DARN.


Since DARN is an autoregressive model, it needs to sample one value at a time, from top hidden layer all the way down to the observed layer.

Minimum Description Length

This is my favorite section of this paper. There is a strong connection between the information theory and variational lowerbound. In EM algorithm, we use Jensen’s inequality to derive the smooth function that acts as a lowerbound of the log likelihood. Some scholars refer this lowerbound as an Evidence Lowerbound (ELBO).

This lowerbound can be derived from information theory perspective. From the Shannon’s theory, the description length is:

L(x|h) = -\log_2 p(x|h)

If h is a compressed version of x, then we need to transport h along with the residual in order to reconstruct the original message x.

The main idea is simple. The less predictable event requires more bits to encode. The shorter bits is better because we will transport fewer bits over the wire. Hence, we want to minimize the description length of the following message x:

L(x) = \sum_h q(h|x)(L(h) + L(x|h))

The q(h|x) is an encode probability of h. The description length of the representation h is defined as:

L(h) = -log_2 p(h) + \log_2 q(h|x)

Finally, the entire description length or Helmholtz variational free energy is:

L(x) = -sum_h q(h|x)(\log_2 p(x,h) - \log_2 q(h|x)) (1)

This is formula is exactly the same as the ELBO when h is a discrete value.


The variational free energy formula (1) is intractable because it requires summation over all h. DARN employs a sampling approach to learn the parameter.

The expectation term is approximated by sampling h \sim q(H|x), then now we can compute the gradient of (1). However, this approach has a high variance. The authors use a few tricks to keep variance low. (Check out the apprendix).


DARN is one of the early paper that use the stochastic layers as part of its model. Optimization through these layers posed a few challenges such as high variances from the Monte Carlo approximation.


[1] Gregor, Karol, et al. “Deep autoregressive networks.” arXiv preprint arXiv:1310.8499 (2013).

[2] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[3] Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and approximate inference in deep generative models.” arXiv preprint arXiv:1401.4082 (2014).


Embedding-based Query Language Models (ICTIR’16)

This paper proposes a more accurate query language model for query expansion task based on word embedding.

The key contribution is their observation that the cosine similarity score between the top words with the query word is not too distant from the similarity score between the 1000th word with the same query. They use a sigmoid function to amplify the similarity score.

The rest of the paper proposes two query language models.

The first model

The first model assumes the conditional independence between the query terms.

The model is:

p(w|\theta_q) \propto p(\theta_q|w)p(w) = p(w)\prod_{i=1}^kp(q_i|w)

It defines: p(q_i|w) = \frac{\delta(q_i,w)}{\sum_{w'\in V} \delta(w',w)}

And p(w) = \sum_{w' \in V} p(w, w') \propto \sum_{w' \in V} \delta(w, w')

The second model

p(w|\theta_q) = \sum_{w'\in V} p(w,w'|\theta_q)p(w'|\theta_q)

This model now assumes that query and term similarity is independent:

$latex p(w,w’|\theta_q) = p(w|w’) = \frac{\delta(w,w’)}{\sum_{w”\in V} \delta(w”, w’)}$

and the second term uses MLE to estimate the probability:

p(w'|\theta_q) = \frac{C(w'|Q)}{|Q|}

The paper also talked about the embedded-based relevance model. The existing model by the same author is now extent with the proposed embedded query language model.


[1] Zamani, Hamed, and W. Bruce Croft. “Embedding-based query language models.” Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval.

Pseudo-Relevance Feedback based on Matrix Factorization (CIKM’16)

This work uses pseudo-relevance feedback collections to recommend top words and their weights to the input query.

It starts with creating a document-word matrix where each row each retrieved documents and each column is term weight. The first row is the input query’s term weights. The rest of documents either come from relevant feedback or top-K retrieved documents (pseudo-relevant feedback).

Then, we perform a non-negative matrix factorization (NMF) on the matrix. The first row of the reconstructed matrix is a re-estimate query and its top m terms will be used query expansion terms.


However, this model needs to re-compute NMF for every query, this can be expensive operations. Thus, the paper recommends creating a small document-word matrix (e.g use only top 10 retrieved documents).

What makes this model stands out is that the model not only considers the words that discriminate the top documents from the collection, it only considers the relevance of those words to the original query.

I am wondered if calculate the NMF on such a small matrix will really work. The empirical results demonstrate that this idea works but its performance is not impressive. But I think using NMF for a query expansion task is a neat idea.


Generating images from captions with attention (ICLR’16)


This work extends the original DRAW paper [2] to generate images given captions. We can treat this model as conditional DRAW. That is we model a conditional probability of P(\text{image}|\text{caption}) The additional textual data controls where to read and write the image.


Generating images from text descriptions is a structure prediction task. That is given a sequence of words, we want to generate an image. Although AlignDRAW has borrowed the same approach as DRAW by combining progressive refinement with attention, incorporating text sequence is their contribution.

The latent variable in DRAW model is sampled from spherical Gaussians, z_t \sim \mathcal{N}(Z_t|\mu_t, \sigma_t) where its mean and variance are functions of the current hidden state of the encoder, e.g. \mu_t = W(h_t^{\text{enc}}). However, AlignDRAW adds dependency between latent variables: z_t \sim \mathcal{N}(\mu(h_{t-1})^{\text{gen}}, \sigma(h_{t-1}^{\text{gen}}).

During the image generation, DRAW iteratively samples a latent variable z_t from a prior \mathcal{N}(0, I), but AlignDRAW will draw z_t from P(Z_t|Z_{<t}). It means that there is a dependency between each latent vector in AlignDRAW model.


Align Operator

The input caption is fed to the BI-Directional RNN. Each output from each time-step needs to be aligned with the current drawing patch. Attention weight is then learned from caption representation up to k words and the current hidden state of the decoder h_{t-1}^{\text{gen}}. Finally, compute the weight average of all hidden state of the language model to obtain the caption context, s_t. This context together with a latent vector z_t will be fed to the LSTM decoder.

Objective Function

This model maximizes the expectation of the variational lowerbound. There are 2 terms: the data likelihood and KL loss.

Closing Thoughts

AlignDRAW uses bi-directional LSTM with attention to aligning each word context with the patches in the image. Some generated images from caption are interesting such as ‘A herd of elephants walking across a dry grass field’. The model generalizes the training data and able to generate novel images.


[1] Mansimov, Elman, et al. “Generating images from captions with attention.” arXiv preprint arXiv:1511.02793 (2015).

[2] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).

DRAW: A Recurrent Neural Network For Image Generation (ICML’15)

This paper proposes a new method for image generation by progressively improve the reconstructed image.

The previous image generation models generate the entire image by learning a sampling function (GANs), distribution over a latent vector (VAE), or generate one pixel at a time (PixelRNN, PixelCNN). Although the generated images from these models are in a good quality, these models are forced to learn a complicated and high-dimensional distribution. For example, to generate a car image, the models need to approximate the distribution of all possible cars. This is a difficult task.


(Note: I took this Figure from the original paper)

Incremental Update

Progressive refinement breaks down the complex distribution into a chain of conditional distribution:

P(X, C_T) = P(X|C_T)P(C_T) = P(X|C_T)P(C_T|C_{T-1}) \cdots P(C_1|C_0)

Therefore, estimating a conditional distribution is much easier. The conditional probability is modeled by the standard LSTM.

Latent Variable

Use VAE framework helps us project the input image which has a high dimension into a low-dimensional space. Working on the smaller latent space is much easier than the original image space.

Attention Mechanism

The progressive refinement through LSTM has simplified the complex distribution through time, then the attention mechanism simplifies the spatial data into a smaller patch. The encoder and decoder now only needs to deal with a small fraction of the image instead of the image as a whole. This idea again reduces the input space by focusing on the important part of the image only.

Read and Write Operations

This part can be intimated to read at the first glance due to the use of the Gaussian filters. There are many nice blogs that described Read and Write operations with attention mechanism in detail. The main idea is the Read operation crops the input image. The Write operation draws a patch to the canvas matrix.


This is a must-read paper. The combine of progress refinement through time with attention mechanism is a nice idea to simplify the complex image distribution. This is one of the early paper that combine RNN with attention to handle the spatial data such image. I think this is an amazing accomplishment.


[1] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).


DocNADE (JMLR 2016)

This work extends Neural Autoregressive Distribution Estimation (NADE) for a document modeling.


The key idea of NADE is each hidden and output vectors are modeled as a conditional probability of previously seen vectors:

p(\textbf{v}) = \prod_{i=1}^D p(v_i | \textbf{v}_{<i})

\textbf{h}( \textbf{v}_{<i}) = g(c + W_{:,<i}\textbf{v}_{<i})

Then, the probability of the output is:

p(v_i=1|\textbf{v}_{<i}) = \sigma(b_i + V_{i,:}\textbf{h}_i(\textbf{v}_{<i}))


NADE has a set of separated hidden layers, each represents the previously seen context. However, NADE is not applicable for a variable length input such as a sequence of words.


DocNADE model tackles a variable length input issue by computing the hidden vector as follows:

\textbf{h}( \textbf{v}_{<i}) = g(c + \sum_{k<i} W_{:,v_k})

Each word v_k is an index in the vocabulary of fixed length. Each column of matrix W is a word embedding. Hence, a summation of word vectors represents a previous word context. This does not preserve the word order since the model simply sums all word vectors.


The output layer requires a softmax function to compute the word probability. A hierarchy softmax is necessary to scale up this calculation.

DocNADE Language Model

The previous model may not suitable for language model because it focuses on learning a semantic representation of the document. The hidden layer now needs to pay more attention to the previous terms. It can be accomplished by using n-gram model:

\textbf{h}_i(\textbf{v}_{<i}) = g(\textbf{b} + \textbf{h}_i^{DN}(\textbf{v}_{<i}) + \textbf{h}_i^{LM}(\text{v}_{<i}))

The additional hidden unit $latex \textbf{h}_i^{LM}$ models a n-gram language model:

$latex \textbf{h}_i^{LM}(\text{v}_{<i}) = \sum_{k=1}^{n-1}U_k \dot W_{:,v_{i-k}}^{LM}$

The matrix W^{LM} is a word embedding based on n-gram model.



DocNADE is similar to Recurrent Neural Network model where both models estimate the conditional probability of the current input given the previous input. For language modeling task, RNN is less explicit on how much word or context to look back. But DocNADE requires us to explicitly tell the model the number of words to look back. On the other hand, DocNADE has a similar favor to Word2Vec where the document representation is simply an aggregate of all previously seen words. However, DocNADE adds additional transformation on top of hidden units.

Will this type of Autoregressive model fall out of fashion due to the success of Recurrent Network with Attention mechanism and memory model? The current trend suggests that RNN is more flexible and extensible than NADE. Hence, there will be more development and extension of RNN models more and more in the coming year.


Lauly, Stanislas, et al. “Document neural autoregressive distribution estimation.” arXiv preprint arXiv:1603.05962 (2016).


Towards a Neural Statistician (ICLR2017)

One extension of VAE is to add a hierarchy structure. In the classical VAE, the prior is drawn from a standard Gaussian distribution. We can learn this prior from the dataset so that each dataset has its own prior distribution.

The generative process is:

  • Draw a dataset prior \mathbf{c} \sim N(\mathbf{0}, \mathbf{I})
  • For each data point in the dataset
    • Draw a latent vector \mathbf{z} \sim P(\cdot | \mathbf{c})
    • Draw a sample \mathbf{x} \sim P(\cdot | \mathbf{z})

The likelihood of the dataset is:

p(D) = \int p(c) \big[ \prod_{x \in D} \int p(x|z;\theta)p(z|c;\theta)dz \big]dc

The paper define the approximate inference network, q(z|x,c;\phi) and q(c|D; \phi) to optimize a variational lowerbound. The single dataset log likelihood lowerboud is:

\mathcal{L}_D = E_{q(c|D;\phi)}\big[  \sum_{x \in d} E_{q(z|c, x; \phi)}[ \log p(x|z;\theta)] - D_{KL}(q(z|c,x;\phi)||p(z|c;\theta)) \big] - D_{KL}(q(c|D;\phi)||p(c))

The interesting contribution of this paper is their statistic network q(c|D; \phi) that approximates the posterior distribution over the context c given the dataset D. Basically, this inference network has an encoder to take each datapoint into a vector e_i = E(x_i). Then, add a pool layer to aggregate e_i into a single vector. This paper uses an element-wise mean. Finally, the final vector is used to generate parameters of a diagonal Gaussian.


This model surprisingly works well for many tasks such as topic models, transfer learning, one-shot learning, etc.

Reference: (Poster ICLR 2017)