Deep Autoregressive Network (ICML’14)

This is one of the early paper on generative modeling. This work was on arXiv since Oct 2013 before the reparameterization trick has been popularized [2, 3]. It is interesting to look back and see the challenge of training the model with stochastic layers.

Model Architecture

Deep Autoregressive Network (DARN) is a flexible deep generative model with the following architecture features: First, its stochastic layer is stackable. This improves representational power. Second, the deterministic layers can be inserted between the stochastic layers to add complexity to the model.  Third, the generative model such as NADE and EoNADE can be used instead of the simple linear autoregressive. This also improves the representational power.



The main difference from VAE [2] is that the hidden units are binary vectors (which is similar to the restricted Boltzmann machine). VAE requires a continuous vector as hidden units unless we approximate the discrete units with Gumbel-Softmax.

DARN does not assume any form of distribution on its prior p(h) or conditional distribution p(x|h) and q(h|x). The vanilla VAE assumes a standard Gaussian distribution with the diagonal covariance matrix. This could be either good or bad thing for DARN.


Since DARN is an autoregressive model, it needs to sample one value at a time, from top hidden layer all the way down to the observed layer.

Minimum Description Length

This is my favorite section of this paper. There is a strong connection between the information theory and variational lowerbound. In EM algorithm, we use Jensen’s inequality to derive the smooth function that acts as a lowerbound of the log likelihood. Some scholars refer this lowerbound as an Evidence Lowerbound (ELBO).

This lowerbound can be derived from information theory perspective. From the Shannon’s theory, the description length is:

L(x|h) = -\log_2 p(x|h)

If h is a compressed version of x, then we need to transport h along with the residual in order to reconstruct the original message x.

The main idea is simple. The less predictable event requires more bits to encode. The shorter bits is better because we will transport fewer bits over the wire. Hence, we want to minimize the description length of the following message x:

L(x) = \sum_h q(h|x)(L(h) + L(x|h))

The q(h|x) is an encode probability of h. The description length of the representation h is defined as:

L(h) = -log_2 p(h) + \log_2 q(h|x)

Finally, the entire description length or Helmholtz variational free energy is:

L(x) = -sum_h q(h|x)(\log_2 p(x,h) - \log_2 q(h|x)) (1)

This is formula is exactly the same as the ELBO when h is a discrete value.


The variational free energy formula (1) is intractable because it requires summation over all h. DARN employs a sampling approach to learn the parameter.

The expectation term is approximated by sampling h \sim q(H|x), then now we can compute the gradient of (1). However, this approach has a high variance. The authors use a few tricks to keep variance low. (Check out the apprendix).


DARN is one of the early paper that use the stochastic layers as part of its model. Optimization through these layers posed a few challenges such as high variances from the Monte Carlo approximation.


[1] Gregor, Karol, et al. “Deep autoregressive networks.” arXiv preprint arXiv:1310.8499 (2013).

[2] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[3] Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and approximate inference in deep generative models.” arXiv preprint arXiv:1401.4082 (2014).


DocNADE (JMLR 2016)

This work extends Neural Autoregressive Distribution Estimation (NADE) for a document modeling.


The key idea of NADE is each hidden and output vectors are modeled as a conditional probability of previously seen vectors:

p(\textbf{v}) = \prod_{i=1}^D p(v_i | \textbf{v}_{<i})

\textbf{h}( \textbf{v}_{<i}) = g(c + W_{:,<i}\textbf{v}_{<i})

Then, the probability of the output is:

p(v_i=1|\textbf{v}_{<i}) = \sigma(b_i + V_{i,:}\textbf{h}_i(\textbf{v}_{<i}))


NADE has a set of separated hidden layers, each represents the previously seen context. However, NADE is not applicable for a variable length input such as a sequence of words.


DocNADE model tackles a variable length input issue by computing the hidden vector as follows:

\textbf{h}( \textbf{v}_{<i}) = g(c + \sum_{k<i} W_{:,v_k})

Each word v_k is an index in the vocabulary of fixed length. Each column of matrix W is a word embedding. Hence, a summation of word vectors represents a previous word context. This does not preserve the word order since the model simply sums all word vectors.


The output layer requires a softmax function to compute the word probability. A hierarchy softmax is necessary to scale up this calculation.

DocNADE Language Model

The previous model may not suitable for language model because it focuses on learning a semantic representation of the document. The hidden layer now needs to pay more attention to the previous terms. It can be accomplished by using n-gram model:

\textbf{h}_i(\textbf{v}_{<i}) = g(\textbf{b} + \textbf{h}_i^{DN}(\textbf{v}_{<i}) + \textbf{h}_i^{LM}(\text{v}_{<i}))

The additional hidden unit $latex \textbf{h}_i^{LM}$ models a n-gram language model:

$latex \textbf{h}_i^{LM}(\text{v}_{<i}) = \sum_{k=1}^{n-1}U_k \dot W_{:,v_{i-k}}^{LM}$

The matrix W^{LM} is a word embedding based on n-gram model.



DocNADE is similar to Recurrent Neural Network model where both models estimate the conditional probability of the current input given the previous input. For language modeling task, RNN is less explicit on how much word or context to look back. But DocNADE requires us to explicitly tell the model the number of words to look back. On the other hand, DocNADE has a similar favor to Word2Vec where the document representation is simply an aggregate of all previously seen words. However, DocNADE adds additional transformation on top of hidden units.

Will this type of Autoregressive model fall out of fashion due to the success of Recurrent Network with Attention mechanism and memory model? The current trend suggests that RNN is more flexible and extensible than NADE. Hence, there will be more development and extension of RNN models more and more in the coming year.


Lauly, Stanislas, et al. “Document neural autoregressive distribution estimation.” arXiv preprint arXiv:1603.05962 (2016).


Sequence Autoencoder

Back in 2010, RNN is a good architecture for language models [3] due to its ability to remember the previous context. We will explore a few RNN architecture for learning document representation in this post.

Semi-supervised Sequence Learning [2] (NIPS 2014)

This model uses two RNN, the first one as an encoder, and later as a decoder.


Instead of learning to generate the output like in seq2seq model [1], this model learns to reconstruct the input. Hence, this model is a sequence autoencoder. LSTM is used in this paper. This unsupervised learning model is used for pretraining LSTM for different tasks such as sentiment analysis, text classification, and object classification.

A Hierarchical Neural Autoencoder for Paragraphs and Documents [4] (ACL 2015)

This model introduces a hierarchical LSTM to learn a document structure. The architecture has both encoder and decoder. The encoder processes a sequence of word tokens for each sentence. The final output from LSTM represents the input sentence. Then, the second LSTM layer will take a sequence of sentence vectors and output a document vector.

The decoder works in a backward fashion. It takes a document vector and feeds it to the LSTM to decode a sentence vector. Each sentence vector is then fed to another LSTM to decode each word in the sentence.

The author also introduces attention mechanism to put emphasis on particular sentences. The attention boosts the performance of the hierarchical model.


Generating Sentences from a Continuous Space [5]

This model combines RNNLM [3] with Variational autoencoder. The architecture is again composed of an encoder and decoder and attempts to reconstruct the given input. The additional stochastic layer converts an output from an encoder to mean and variance of the target Gaussian distribution. The document representation is sampled from this distribution. The decoder takes this representation and reconstructs word by word through another LSTM.


Training VAE under this architecture poses a challenge due to the component collapsing. The authors use KL annealing method by incrementally increases the weight of the KL loss over time. This modification helps the model to learn a much better representation.


There are a few more RNN architectures that learn document/sentence representation. The goal of learning the representation can be varied. If the goal is to generate a realistic text or dialogue then it is critical to retain syntactic accuracy as well as semantic information. However, if our goal is to obtain a global view of the given document, then we may bypass syntactic details but focus more on semantic meaning. These 3 models show how RNN architectures can be used to model for such tasks.


[1] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014.

[2] Dai, Andrew M., and Quoc V. Le. “Semi-supervised sequence learning.” Advances in Neural Information Processing Systems. 2015.

[3] Mikolov, Tomas, et al. “Recurrent neural network based language model.” Interspeech. Vol. 2. 2010.

[4] Li, Jiwei, Minh-Thang Luong, and Dan Jurafsky. “A hierarchical neural autoencoder for paragraphs and documents.” arXiv preprint arXiv:1506.01057 (2015).

[5] Bowman, Samuel R., et al. “Generating sentences from a continuous space.” arXiv preprint arXiv:1511.06349 (2015).



The autoregressive model is applicable to generate a realistic audio signal. Given the waveform \textbf{x} = {x_1, x_2, \cdots, x_T}, the joint probability of a waveform is

p(\textbf{x}) = \prod_{t=1}^T p(x_t | x_1, \cdots, x_{t-1})

Recurrent neural nets is a perfect model to learn and predict a one-dimensional sequence. However, since an audio signal has a lot of samples per second (44.1 kHz), using RNN to process one sample at a time is too slow.

This paper uses casual convolutional neural nets by predicting: p(x_{t+1}|x_1, \cdots, x_t). Masking the convolutional kernel to avoid the current output to see the future input. A convolutional NN architecture is more scalable than RNNs because we can process multiple inputs in parallel. The main limitation is that the convolutional kernel has to be very large to cover a longer range dependency.

Dilated convolution architecture uses a convolution kernel with holes in order to cover a large area of input signals. The kernel is dilated by filling zeros to expand the kernel while skipping some inputs in between.

Stacking convolutional NNs are an effective method to increase the depth of the networks. Residual and skip connections are utilized to speed up convergence of the model.

Wavenet can model the conditional probability given an external input such as a speaker identity, output text or phonic, p(\textbf{x}|\textbf{h}) = \prod_{t=1}^T p(x_t|x_1,\cdots,x_{t-1},\textbf{h}). There are two ways to construct an activation function that depends on textbf{h}, an extra input/context.

Global conditioning: all the output depends on the given context \textbf{h} directly.

Local conditioning: the input context is broken down into a timeseries h_t. Upsampling this signal to match the length of the input audio sample.

Wavenet generates realistic human voices both in English and mandarin.  It can also generate a piece of music audio but it sounds like a mad pianist storming on the piano.

You can check some generated output from DeepMind website:


Pixel Recurrent Neural Networks (ICML’16)

This paper proposes a novel autoregressive model to generate an image pixel by pixel. This idea makes sense since each pixel is correlated to its neighbor pixels. The same object has similar color and gradient. This idea has been exploited in an image compression such as jpeg where the color frequency does not change too frequent.

With this intuition, an autoregressive model assigns a probability of the observed sequence \vec x as p(\vec x) = \prod_{i=1}^{n^2} p(x_i | x_{<i}) For an image, each pixel is conditioned on the previous seen pixels.

A recurrent neural network such as LSTM has been an effective architecture for a sequence learning task. Hence, LSTM can be applied for an image generation task as well. There are two main challenges with using an autoregressive model for an image: first, an image is a 2-dimensional object – squeezing it to a 1-d vector will lose spatial information; second, training pixel by pixel is time-consuming and can’t be parallelized because we have to process the previous pixel before the current pixel. This paper attempts to solve these problems.

Model a color image 

Each pixel x_i has 3 values: an intensity in the red, green, and blue channels. The conditional distribution is modified as:

p(x_i|\vec x_{<i}) = p(x_{i,R}|\vec x_{<i})p(x_{i,G}|\vec x_{<i},x_{i,R})p(x_{i,B}|\vec x_{<i},x_{i,R},x_{i,G})

Each color is conditioned on the other channels as well as the previous pixel values.

Pixels as Discrete Value

This is a neat idea. Using a discrete distribution provides more flexibility because we do not assume the shape of the distribution.

Now we will dive into the proposed architectures:


When we apply LSTM on 2-dimensional input, we can first compute all the hidden states given the current pixel values and previous states. In this architecture, the previous states are the hidden states from the above rows. We can define the context window to control the amount of information to capture from the above row. If we set the context window to 3 (meaning taking the top, top-right, and top-left pixels’ hidden states):

h_{r, c} = f( h_{r-1, c-1}, h_{r-1, c}, h_{r-1, c+1}, x_{r, c})

We can parallelize this computation row by row.

Diagonal BiLSTM

One limitation of ROW LSTM is its lack of full context from the above row. By using bi-directional LSTM, this architecture can capture forward and backward dependency. Furthermore, each hidden state depends on the top-left and left hidden states:

h_{r,c} = f(h_{r-1,c-1}, h_{r,c-1}, x_{r,c})

However, the input image needs to be screwed by shifting each row by one pixel to the right, in order to parallelize this operation, column by column.

Pixel CNN

The sequential model is slow because it needs to process the previous states. Using convolutional layers to capture the spatial information is possible. But directly applying a convolutional kernel violates the autoregressive model’s assumption because the current pixel must not see the next pixel. Otherwise, we can’t generate the pixel because we will not know any incoming pixel. Hence, the mask is used to prevent the current pixel from seeing the future pixels. This trick is pretty neat.


Diagonal BiLSTM has the best performance out of 3 architectures in terms of nats (negative log-likelihood) scores. The autoregressive model does not assume the low-dimensional manifold assumption and it models a conditional distribution directly. However, the computation is quite expensive for a large input.


Neural Autoregressive Collaborative Filtering for Implicit Feedback

This short paper extends the NADE-CF for implicit feedback dataset.  The main difference between implicit and explicit dataset is that the implicit feedback data does not have a negative sample. For example, the lack of a number of clicks or view counts does not imply that a particular user dislikes that item. He/she may not aware the existing of that item.

The author extent CF-NADE to solve the implicit feedback problem by predicting the likeness probability:

p(\textbf{t}) = \prod_{i=1}^M p(t_i | \textbf{t}_{<i})

where M is the number of items, t_i = 1 indicates user u likes item i, t_i = 0 when user u dislikes item i. However, we do not know whether user u will like item i. So we need to convert implicit feedback r_{ui} to t_{ui} by setting t_{ui} when r_{ui} > 0.

Since this binarizing an implicit feedback is noisy, the confidence that user u like or dislike the given item should be included in the model. One way to estimate the confidence is:

c_{ui} = 1 + \alpha r_{ui}

The higher implicit feedback score user u gives the higher confidence.

Thus, the final model becomes:

p(\textbf{t}|\textbf{c}) = \prod_{i=1}^M p(t_i | \textbf{t}_{<i}, \textbf{c}_{<i})

This conditional probability is modeled as an autoregressive model with a hidden variable. The confidence will basically act as a weight on the loss function:

C = -\sum_{i=1}^M c_i \log p(t_i|\textbf{t}_{<i}, \textbf{c}_{<i})

Overall, this short paper provides a simple extension of CF-NADE by binarizing an implicit feedback and adding confident random variable to the original CF-NADE model.


A Neural Autoregressive Approach for Collaborative Filtering (ICML’16)

This paper proposed an autoregressive model for rating prediction. The idea of this paper is similar to CF-RBM for collaborative filtering problem. Instead, it replaces an RMB with NADE. Since NADE is a mean-field approximation of RBM and also CF-RBM works for collaborative filtering problem, it makes sense that NADE model is applicable for CF problem.

The NADE model aims to estimate a probability of the given sequence as P(\vec{r}) = \prod_{i=1}^D P(r_{m_{o_i}}|\vec{r}_{m_{o_{<i}}}) The previously seen sequence of rating can be used to predict the current rating. This also implies that some early rating on the known sequence has a strong influence on the current rating. This observation made NADE a different model from RNN.

The conditional probability distribution is defined as:

P(r_{m_{o_i}}|\vec{r}_{m_{o_{<i}}}) = \frac{\exp(s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}))}{\sum_{l=1}^K\exp(s^l_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}))}

The model needs to estimate a score function s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}) such that it gives a high score when a user will likely to give a rating k to an item m_{o_i} given her previous rating history.

The score function can be seen as a dot product of a user state with an item latent vector. s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}) = b^k_{m_{o_i}} + V^k_{m_{o_i},:}\textbf{h}(r_{m_{o_{<i}}}). The bias is an item popularity and the dot product is the similarity between the current taste of the given user to the item o_i.

The hidden state captures the user’s current interest. It means that this model captures the change of behavior. This hidden state is defined as \textbf{h}(r_{m_{o_{<i}}}) = g(c + \sum_{j<i}W^{r_{m_{o_j}}}_{:,m_{o_j}}) The function g is an activation function, c is a bias, and a matrix W is an interaction matrix that transforms the input rating vector to a item vector.

The NADE-CF can be complicated but the main idea is the same as RBM-CF. Each user will have her own NADE network. NADE-CF shares all weight parameters among all users to avoid overfitting. Further, since each weight matrix associated with a rating, some weight matrix might be updated less frequently than others. The author changes the formula to compute the hidden state (look at the eq. 9 in the original paper). This way, the rare rating matrix will be updated more often.

In order to reduce the number parameters further, the low-rank matrix approximation is applied to matrix W and V. In addition, to improve the predictability further, the ordinal cost is introduced. The idea is if the user rating item m with rating k. Her rating preference of k-1 should be higher than k-2 and so on. In other words, if a user gave a rating of 3, it is likely that she will give a rating of 2 instead of 1 because a rating of 2 is closer to a rating of 3. The ordinal cost then is modeled as a rank loss similar to the list-wise learning to rank model.

Overall, NADE-CF yields a very good performance on the popular CF data set such as Netflix and MovieLens. The model is complicated and seems to be slow. However, with ordinal cost, low-rank approximation, and its ability to have a deeper model, made this model more attractive and it might be worthwhile to investigate an autoregressive model for this particular task.