Neural Autoregressive Distribution Estimation (JMLR’16)

The generative model estimates P(x) or P(x,z) which is different from the discriminative model which estimates a conditional probability P(z|x) directly. An autoregressive model is one of three popular approaches in deep generative models beside GANs and VAE. It models P(x)  = \prod_i P(x_i | x_{<i})

It models P(x)  = \prod_i P(x_i | x_{<i}) – it factors a joint distribution of an observation as a product of an independent conditional probability distribution. However, implement this model to approximate these distributions directly is intractable because we need parameters for each observation.

NADE [1] proposed a scalable approximation by sharing weight parameters between each observation. The sharing parameters method reduces the number of free parameters and has an effect of regularization because the weight parameters must accommodate for all observation.

For technical details, NADE models the following distribution:

P(x) = \prod_{d=1}^D P(x_{o_d}|x_{o_<d}) where d is an index of the permutation o. For example, if we have x = {x_1, x_2, x_3, x_4}, and o = {2, 3, 4, 1}, then x_{o_1} = x_2. The permutation of the observation is more generic notations. Once we model the observation, the hidden variables can be computed as:

\vec h_d = \sigma(W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c)  And we can generate the observation ( a binary random variable) using a sigmoid function:

P(x_{o_d} = 1| \vec x_{o_{<d}}) = \sigma(V_{o_d}\vec h_d + b_{o_d})

If we look at NADE’s architecture, it is similar to RBM. In fact, [1] shows that NADE is in fact, a mean-field approximation of RBM (see [1] for details).

Another important property of NADE is computing a joint distribution P(x) is linear to the number of dimension of observations because we can express W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c recursively:

Define the basecase as: \vec a_1 = \vec c

And the recurrent relationship as: \vec a_d =  W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c = W_{:,o_{d-1}} x_{o_{d-1}} + \vec a_{d-1}

This means that computing \vec h_d = \sigma(\vec a_d) can be done in a linear fashion.

There are many extension of the Autoregressive model, one of the extension CF-NADE is currently the state-of-the-art of CF. This model can model a binary/discrete random variable which VAE is unable to model currently. So this can be useful for any problem that requires a discrete random variable.


[1] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” Journal of Machine Learning Research 17.205 (2016): 1-37.