A Recurrent Latent Variable Model for Sequential Data (NIPS’15)

This paper presents a sequential model that is incorporating uncertainty to better model variability that arises from the data itself.

The motivation comes from the fact that data itself especially speech signal has a high variability that does not come from the noise alone. The complex relationship between observed data and an underlying factor of the variability cannot be modeled by the basic RNN alone. For example, vocal quality of the speaker affects the wave audio even though the speaker says the same word.

In a classical RNN, the state transition h_t = f_{\theta}(x_t, h_{t-1}) is a deterministic function and typically f_{\theta} is either LSTM or GRU. RNN models the joint probability of the entire sequencep(x_1, x_2, \cdot, x_T) = \prod_{t=1}^T p(x_t | x_{<t}) =\prod_{t=1}^T g_{\tau}(h_{t-1}) whereg_{\tau} is an output function that maps hidden state to the probility distribution of the output. The choice ofg_{\tau} depends on the problem. Typically, function g has 2 parts: (1) parameter generator,\phi_t = \varphi_{\tau}(h_{t-1}) and (2) density function: P_{\phi_t}(x_t | x_{<t}) . We can also make function g as a GMM; hence, function \phi_t will generate a mixture coefficient parameters.

The source of variability in RNN comes from the output function g alone. This can be problematic in speech signal because RNN must map many variants of input wave to a potentially large variation of the hidden state h_t . The limitation of RNN motivates the author to introduce uncertainty into RNN.

In order to turn RNN to an un-deterministic model, the author assumes that each data point x_t has a latent variable z_t where the latent variable is drawn from a standard Gaussian distribution initially. The generative process is as follows:

  • For each step t to T
    • Compute prior parameters:[\mu_{0,t}, \text{diag}(\sigma_{0,t})] = \phi_{\tau}^{\text{prior}}(h_{t-1})
    • Draw a prior:z_t \sim N(\mu_{0,t}, \text{diag}(\sigma_{0,t}^2))
    • Compute likelihood parameters: [\mu_{x,t},\sigma_{x,t}] = \phi_{\tau}^{\text{dec}}(\phi_{\tau}^z(z_t), h_{t-1})
    • Draw a sample:x_t | z_t \sim N(\mu_{x,t}, \text{diag}(\sigma_{x,t}^2))
    • Compute a hidden state:h_t = f_{\theta}(\phi_{\tau}^x(x_t), \phi_{\tau}^z(z_t), h_{t-1})

The state transition function is now an un-deterministic function because z_t is a random variable. Also, the hidden state h_t depends on  x_{<t}, z_{<t}, therefore, we can replace  h_t with:

  • z_t \sim p(z_t | x_{<t}, z_{<t})
  • x_t|z_t \sim p(x_t | z_{\le t}, x_{\le t})

Thus, the joint distribution becomes:

p(x_{\le T}, z_{\le T}) = \prod_{t=1}^T p(x_t|z_{\le t}, x_{<t})p(z_t|x_{<t},z_{<t})

The objective function is to maximize the log-likelihood of the input sequence:p(x_{\le T}) = \int_z p(x_{\le T}, z_{\le T}) dz . By assuming the approximate posterior distribution q(z_{\le T} | x_{\le T}) = \prod_{t=1}^T q(z_t | x_{\le t}, z_{<t}) is factorizable, the ELBO is:

E_{q(z_{\le T}|x_{\le T})}\big[ \sum_{t=1}^T \log p(x_t|z_{\le t},x_{<t}) - KL(q(z_t|x_{\le t},z_{<t}) || p(z_t|x_{<t},z_{<t})) \big]

The ELBO can be trained efficiently through variational autoencoder framework. In fact, this model is a sequential version of the classical variational autoencoder.

References:

Chung, Junyoung, et al. “A recurrent latent variable model for sequential data.” Advances in neural information processing systems. 2015.