Improved Variational Autoencoders for Text Modeling using Dilated Convolutions (ICML’17)

One of the reasons that VAE with LSTM as a decoder is less effective than LSTM language model due to the LSTM decoder ignores conditioning information from the encoder. This paper uses a dilated CNN as a decoder to improve a perplexity on held-out data.

Language Model

The language model can be modeled as:

p(\textbf{x}) = \prod_t p(x_t | x_1, x_2, \cdots, x_{t-1})

LSTM language model use this conditional distribution to predict the next word.

By adding an additional contextual random variable [2], the language model can be expressed as:

p(\textbf{x}, \textbf{z}) = \prod_t p(x_t | x_1, x_2, \cdots, x_{t-1}, \textbf{z})

The second model is more flexible as it explicitly model a high variation in the sequential data. Without a careful training, the VAE-based language model often degrades to a standard language model as the decoder chooses to ignore the latent variable generated by the encoder.

Dilated CNN

The authors replace LSTM decoder with Dilated CNN decoder to control the contextual capacity. That is when the convolutional kernel is large, the decoder covers longer context as it resembles an LSTM. But if the kernel becomes smaller, the model becomes more like a bag-of-word. The size of kernel controls the contextual capacity which is how much the past context we want to use to predict the current word.


Stacking Dilated CNN is crucial for a better performance because we want to exponentially increase the context windows. WaveNet [3] also uses this approach.


By replacing VAE with a more suitable decoder, VAE can now perform well on language model task. Since the textual sequence does not contain a lot of variation, we may not notice an obvious improvement. We may see more significant improvement in a more complex sequential data such as speech or audio signals. Also, the experimental results show that Dilated CNN is better than LSTM as a decoder but the improvement in terms of perplexity and NLL are still incremental to the standard LSTM language model. We hope to see stronger language models using VAE in the future.


[1] Yang, Zichao, et al. “Improved Variational Autoencoders for Text Modeling using Dilated Convolutions.” arXiv preprint arXiv:1702.08139 (2017).

[2] Bowman, Samuel R., et al. “Generating sentences from a continuous space.” arXiv preprint arXiv:1511.06349 (2015).

[3] Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).


Skip-Thought (NIPS’15)

This paper proposes a deep learning model to learn a representation of word sequences. The model is inspired by the skip-gram from word embedding. Skip-Thought model assumes that a given sentence is related to its preceding and succeeding sentences. Hence, the model has two main components: encoder and decoder. The encoder uses GRU to encode a sentence into a fixed length vector, which is a hidden state of the GRU. Then, there are two decoders that take an encoder output as an input and predict each word of the previous and next sentences. The output from the encoder is treated as a bias input to both decoders.

The Skip-Thought is unsupervised learning model that learns a general sentence representation. In contrast to the existing works that use supervisory signals to learn a high-quality representation and are task specific models.

Vocab expansion is necessary to encode unseen word into a fixed length vector. The authors use a linear transformation from word-embedding space to encoder word-embedding space.

There are some variants of Skip-Thought models used in the experiment. The encoder can either have uni-directional or bi-directional GRU. When concatenating output from these two encoders, the better performance is observed.

The empirical results on various tasks and benchmarks demonstrate that Skip-Thought can learn a robust sentence representation that can yield a competitive performance to the supervised learning models.


A Recurrent Latent Variable Model for Sequential Data (NIPS’15)

This paper presents a sequential model that is incorporating uncertainty to better model variability that arises from the data itself.

The motivation comes from the fact that data itself especially speech signal has a high variability that does not come from the noise alone. The complex relationship between observed data and an underlying factor of the variability cannot be modeled by the basic RNN alone. For example, vocal quality of the speaker affects the wave audio even though the speaker says the same word.

In a classical RNN, the state transition h_t = f_{\theta}(x_t, h_{t-1}) is a deterministic function and typically f_{\theta} is either LSTM or GRU. RNN models the joint probability of the entire sequencep(x_1, x_2, \cdot, x_T) = \prod_{t=1}^T p(x_t | x_{<t}) =\prod_{t=1}^T g_{\tau}(h_{t-1}) whereg_{\tau} is an output function that maps hidden state to the probility distribution of the output. The choice ofg_{\tau} depends on the problem. Typically, function g has 2 parts: (1) parameter generator,\phi_t = \varphi_{\tau}(h_{t-1}) and (2) density function: P_{\phi_t}(x_t | x_{<t}) . We can also make function g as a GMM; hence, function \phi_t will generate a mixture coefficient parameters.

The source of variability in RNN comes from the output function g alone. This can be problematic in speech signal because RNN must map many variants of input wave to a potentially large variation of the hidden state h_t . The limitation of RNN motivates the author to introduce uncertainty into RNN.

In order to turn RNN to an un-deterministic model, the author assumes that each data point x_t has a latent variable z_t where the latent variable is drawn from a standard Gaussian distribution initially. The generative process is as follows:

  • For each step t to T
    • Compute prior parameters:[\mu_{0,t}, \text{diag}(\sigma_{0,t})] = \phi_{\tau}^{\text{prior}}(h_{t-1})
    • Draw a prior:z_t \sim N(\mu_{0,t}, \text{diag}(\sigma_{0,t}^2))
    • Compute likelihood parameters: [\mu_{x,t},\sigma_{x,t}] = \phi_{\tau}^{\text{dec}}(\phi_{\tau}^z(z_t), h_{t-1})
    • Draw a sample:x_t | z_t \sim N(\mu_{x,t}, \text{diag}(\sigma_{x,t}^2))
    • Compute a hidden state:h_t = f_{\theta}(\phi_{\tau}^x(x_t), \phi_{\tau}^z(z_t), h_{t-1})

The state transition function is now an un-deterministic function because z_t is a random variable. Also, the hidden state h_t depends on  x_{<t}, z_{<t}, therefore, we can replace  h_t with:

  • z_t \sim p(z_t | x_{<t}, z_{<t})
  • x_t|z_t \sim p(x_t | z_{\le t}, x_{\le t})

Thus, the joint distribution becomes:

p(x_{\le T}, z_{\le T}) = \prod_{t=1}^T p(x_t|z_{\le t}, x_{<t})p(z_t|x_{<t},z_{<t})

The objective function is to maximize the log-likelihood of the input sequence:p(x_{\le T}) = \int_z p(x_{\le T}, z_{\le T}) dz . By assuming the approximate posterior distribution q(z_{\le T} | x_{\le T}) = \prod_{t=1}^T q(z_t | x_{\le t}, z_{<t}) is factorizable, the ELBO is:

E_{q(z_{\le T}|x_{\le T})}\big[ \sum_{t=1}^T \log p(x_t|z_{\le t},x_{<t}) - KL(q(z_t|x_{\le t},z_{<t}) || p(z_t|x_{<t},z_{<t})) \big]

The ELBO can be trained efficiently through variational autoencoder framework. In fact, this model is a sequential version of the classical variational autoencoder.


Chung, Junyoung, et al. “A recurrent latent variable model for sequential data.” Advances in neural information processing systems. 2015.