DocNADE (JMLR 2016)

This work extends Neural Autoregressive Distribution Estimation (NADE) for a document modeling.

NADE

The key idea of NADE is each hidden and output vectors are modeled as a conditional probability of previously seen vectors:

p(\textbf{v}) = \prod_{i=1}^D p(v_i | \textbf{v}_{<i})

\textbf{h}( \textbf{v}_{<i}) = g(c + W_{:,<i}\textbf{v}_{<i})

Then, the probability of the output is:

p(v_i=1|\textbf{v}_{<i}) = \sigma(b_i + V_{i,:}\textbf{h}_i(\textbf{v}_{<i}))

Nade

NADE has a set of separated hidden layers, each represents the previously seen context. However, NADE is not applicable for a variable length input such as a sequence of words.

DocNADE

DocNADE model tackles a variable length input issue by computing the hidden vector as follows:

\textbf{h}( \textbf{v}_{<i}) = g(c + \sum_{k<i} W_{:,v_k})

Each word v_k is an index in the vocabulary of fixed length. Each column of matrix W is a word embedding. Hence, a summation of word vectors represents a previous word context. This does not preserve the word order since the model simply sums all word vectors.

DocNade

The output layer requires a softmax function to compute the word probability. A hierarchy softmax is necessary to scale up this calculation.

DocNADE Language Model

The previous model may not suitable for language model because it focuses on learning a semantic representation of the document. The hidden layer now needs to pay more attention to the previous terms. It can be accomplished by using n-gram model:

\textbf{h}_i(\textbf{v}_{<i}) = g(\textbf{b} + \textbf{h}_i^{DN}(\textbf{v}_{<i}) + \textbf{h}_i^{LM}(\text{v}_{<i}))

The additional hidden unit $latex \textbf{h}_i^{LM}$ models a n-gram language model:

$latex \textbf{h}_i^{LM}(\text{v}_{<i}) = \sum_{k=1}^{n-1}U_k \dot W_{:,v_{i-k}}^{LM}$

The matrix W^{LM} is a word embedding based on n-gram model.

DocNADE_LM

Summary

DocNADE is similar to Recurrent Neural Network model where both models estimate the conditional probability of the current input given the previous input. For language modeling task, RNN is less explicit on how much word or context to look back. But DocNADE requires us to explicitly tell the model the number of words to look back. On the other hand, DocNADE has a similar favor to Word2Vec where the document representation is simply an aggregate of all previously seen words. However, DocNADE adds additional transformation on top of hidden units.

Will this type of Autoregressive model fall out of fashion due to the success of Recurrent Network with Attention mechanism and memory model? The current trend suggests that RNN is more flexible and extensible than NADE. Hence, there will be more development and extension of RNN models more and more in the coming year.

References:

Lauly, Stanislas, et al. “Document neural autoregressive distribution estimation.” arXiv preprint arXiv:1603.05962 (2016).

 

Neural Autoregressive Distribution Estimation (JMLR’16)

The generative model estimates P(x) or P(x,z) which is different from the discriminative model which estimates a conditional probability P(z|x) directly. An autoregressive model is one of three popular approaches in deep generative models beside GANs and VAE. It models P(x)  = \prod_i P(x_i | x_{<i})

It models P(x)  = \prod_i P(x_i | x_{<i}) – it factors a joint distribution of an observation as a product of an independent conditional probability distribution. However, implement this model to approximate these distributions directly is intractable because we need parameters for each observation.

NADE [1] proposed a scalable approximation by sharing weight parameters between each observation. The sharing parameters method reduces the number of free parameters and has an effect of regularization because the weight parameters must accommodate for all observation.

For technical details, NADE models the following distribution:

P(x) = \prod_{d=1}^D P(x_{o_d}|x_{o_<d}) where d is an index of the permutation o. For example, if we have x = {x_1, x_2, x_3, x_4}, and o = {2, 3, 4, 1}, then x_{o_1} = x_2. The permutation of the observation is more generic notations. Once we model the observation, the hidden variables can be computed as:

\vec h_d = \sigma(W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c)  And we can generate the observation ( a binary random variable) using a sigmoid function:

P(x_{o_d} = 1| \vec x_{o_{<d}}) = \sigma(V_{o_d}\vec h_d + b_{o_d})

If we look at NADE’s architecture, it is similar to RBM. In fact, [1] shows that NADE is in fact, a mean-field approximation of RBM (see [1] for details).

Another important property of NADE is computing a joint distribution P(x) is linear to the number of dimension of observations because we can express W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c recursively:

Define the basecase as: \vec a_1 = \vec c

And the recurrent relationship as: \vec a_d =  W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c = W_{:,o_{d-1}} x_{o_{d-1}} + \vec a_{d-1}

This means that computing \vec h_d = \sigma(\vec a_d) can be done in a linear fashion.

There are many extension of the Autoregressive model, one of the extension CF-NADE is currently the state-of-the-art of CF. This model can model a binary/discrete random variable which VAE is unable to model currently. So this can be useful for any problem that requires a discrete random variable.

References:

[1] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” Journal of Machine Learning Research 17.205 (2016): 1-37.