This paper presents a sequential model that is incorporating uncertainty to better model variability that arises from the data itself.

The motivation comes from the fact that data itself especially speech signal has a high variability that does not come from the noise alone. The complex relationship between observed data and an underlying factor of the variability cannot be modeled by the basic RNN alone. For example, vocal quality of the speaker affects the wave audio even though the speaker says the same word.

In a classical RNN, the state transition is a deterministic function and typically is either LSTM or GRU. RNN models the joint probability of the entire sequence where is an output function that maps hidden state to the probility distribution of the output. The choice of depends on the problem. Typically, function g has 2 parts: (1) parameter generator, and (2) density function: . We can also make function g as a GMM; hence, function will generate a mixture coefficient parameters.

The source of variability in RNN comes from the output function g alone. This can be problematic in speech signal because RNN must map many variants of input wave to a potentially large variation of the hidden state. The limitation of RNN motivates the author to introduce uncertainty into RNN.

In order to turn RNN to an un-deterministic model, the author assumes that each data point has a latent variable where the latent variable is drawn from a standard Gaussian distribution initially. The generative process is as follows:

- For each step t to T
- Compute prior parameters:
- Draw a prior:
- Compute likelihood parameters:
- Draw a sample:
- Compute a hidden state:

The state transition function is now an un-deterministic function because is a random variable. Also, the hidden state depends on , therefore, we can replace with:

Thus, the joint distribution becomes:

The objective function is to maximize the log-likelihood of the input sequence:. By assuming the approximate posterior distribution is factorizable, the ELBO is:

The ELBO can be trained efficiently through variational autoencoder framework. In fact, this model is a sequential version of the classical variational autoencoder.

References:

Chung, Junyoung, et al. “A recurrent latent variable model for sequential data.” Advances in neural information processing systems. 2015.