# Inference network: RNNs vs NNets

The standard variational autoencoder [1] uses neural networks to approximate the true posterior distribution by mapping an input to mean and variance of a standard Gaussian distribution. A simple modification is to replace the inference network from neural nets to RNN. That what exactly this paper present [2].

Intuitively, the RNN will work on the dataset that each consecutive features are highly correlated. It means that for the public dataset such as MNIST, RNN should have no problem approximate posterior distribution of any MNIST digit.

I started with a classical VAE. First, I trained VAE on MNIST dataset, with the hidden units of 500 for both encoders and decoders. I set the latent dimension to 2 so that I can quickly visualize on 2D plot.

2D embedding using Neural Nets (2-layers) as inference network

Some digits are clustered together but some are mixed together because VAE does not know the label of the digits. Thus, it will still put similar digits nearby, aka digit 7’s are right next to digit 9’s. Many digit 3 and 2 are mixed together. To have a better separation between each digit classes, the label information shall be utilized. In fact, our recent publication to SIGIR’2017 utilizes the label information in order to cluster similar documents together.

But come back to our original research question. Is RNN really going to improve the quality of the embedding vectors?

2D embedding using LSTM as inference network

The above 2D plot shows that using LSTM as an inference network has a slightly different embedding space.

2D embedding vectors of randomly chosen MNIST digits using GRU as inference network

LSTM and GRU also generate slightly different embedding vectors. The recurrent model tends to spread out each digit class. For example, digit 6’s (orange) are spread out. All models mixed digit 4 and 9 together. We should know that mixing digits together might not be a bad thing because some writing digit 4 are very similar to 9. This probably indicates that the recurrent model can capture more subtle similarity between digits.

Now, we will see if RNN model might generate better-looking digits than a standard model.

GRU

LSTM

neural nets

It is difficult to tell which models are better. In term of training time, neural nets are the fastest, and LSTM is the slowest. It could be that we have not utilize the strength of RNN yet. Since we are working on MNIST dataset, it might be easy for a traditional model (Neural nets) to perform well. What if we train the model on text datasets such as Newsgroup20? Intuitively, RNN should be able to capture the sequential information. We might get a better embedding space, maybe? Next time we will investigate further on text dataset.

References:

[1] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[2] Fabius, Otto, and Joost R. van Amersfoort. “Variational recurrent auto-encoders.” arXiv preprint arXiv:1412.6581 (2014).

# A Recurrent Latent Variable Model for Sequential Data (NIPS’15)

This paper presents a sequential model that is incorporating uncertainty to better model variability that arises from the data itself.

The motivation comes from the fact that data itself especially speech signal has a high variability that does not come from the noise alone. The complex relationship between observed data and an underlying factor of the variability cannot be modeled by the basic RNN alone. For example, vocal quality of the speaker affects the wave audio even though the speaker says the same word.

In a classical RNN, the state transition $h_t = f_{\theta}(x_t, h_{t-1})$ is a deterministic function and typically $f_{\theta}$ is either LSTM or GRU. RNN models the joint probability of the entire sequence$p(x_1, x_2, \cdot, x_T) = \prod_{t=1}^T p(x_t | x_{ where$g_{\tau}$ is an output function that maps hidden state to the probility distribution of the output. The choice of$g_{\tau}$ depends on the problem. Typically, function g has 2 parts: (1) parameter generator,$\phi_t = \varphi_{\tau}(h_{t-1})$ and (2) density function: $P_{\phi_t}(x_t | x_{. We can also make function g as a GMM; hence, function $\phi_t$ will generate a mixture coefficient parameters.

The source of variability in RNN comes from the output function g alone. This can be problematic in speech signal because RNN must map many variants of input wave to a potentially large variation of the hidden state$h_t$. The limitation of RNN motivates the author to introduce uncertainty into RNN.

In order to turn RNN to an un-deterministic model, the author assumes that each data point $x_t$ has a latent variable $z_t$ where the latent variable is drawn from a standard Gaussian distribution initially. The generative process is as follows:

• For each step t to T
• Compute prior parameters:$[\mu_{0,t}, \text{diag}(\sigma_{0,t})] = \phi_{\tau}^{\text{prior}}(h_{t-1})$
• Draw a prior:$z_t \sim N(\mu_{0,t}, \text{diag}(\sigma_{0,t}^2))$
• Compute likelihood parameters:$[\mu_{x,t},\sigma_{x,t}] = \phi_{\tau}^{\text{dec}}(\phi_{\tau}^z(z_t), h_{t-1})$
• Draw a sample:$x_t | z_t \sim N(\mu_{x,t}, \text{diag}(\sigma_{x,t}^2))$
• Compute a hidden state:$h_t = f_{\theta}(\phi_{\tau}^x(x_t), \phi_{\tau}^z(z_t), h_{t-1})$

The state transition function is now an un-deterministic function because $z_t$ is a random variable. Also, the hidden state $h_t$ depends on  $x_{, therefore, we can replace $h_t$ with:

• $z_t \sim p(z_t | x_{
• $x_t|z_t \sim p(x_t | z_{\le t}, x_{\le t})$

Thus, the joint distribution becomes:

$p(x_{\le T}, z_{\le T}) = \prod_{t=1}^T p(x_t|z_{\le t}, x_{

The objective function is to maximize the log-likelihood of the input sequence:$p(x_{\le T}) = \int_z p(x_{\le T}, z_{\le T}) dz$. By assuming the approximate posterior distribution $q(z_{\le T} | x_{\le T}) = \prod_{t=1}^T q(z_t | x_{\le t}, z_{ is factorizable, the ELBO is:

$E_{q(z_{\le T}|x_{\le T})}\big[ \sum_{t=1}^T \log p(x_t|z_{\le t},x_{

The ELBO can be trained efficiently through variational autoencoder framework. In fact, this model is a sequential version of the classical variational autoencoder.

References:

Chung, Junyoung, et al. “A recurrent latent variable model for sequential data.” Advances in neural information processing systems. 2015.