Autoencoding beyond pixels using a learned similarity metric (ICML’16)

One of the key components of an autoencoder is a reconstruction error. This term measures how much useful information is compressed into a learned latent vector. The common reconstruction error is based on an element-wise measurement such as a binary cross entropy for a black-and-white image or a square error between a reconstructed image and the input image.

The authors think that an element-wise measurement is not an accurate indicator of goodness of a learned latent vector. Hence, they proposed to learn a similar metric via an adversarial training. Here is how they set up the objective functions:

They use VAE to learn a latent vector of the input image. There are two loss functions in a vanilla VAE: a KL loss and a negative log data-likelihood. They replace the second loss with a new reconstruction loss function. We will talk about this new loss function in the next paragraph. Then, they have a discriminator that tries to distinguish the real input data from the generated data from the decoder of VAE. The discriminator will encourage the VAE to learn stronger encoder and decoder.

The discriminator can be decomposed into 2 parts. If it has L + 1 layers, then the first L layers is a transform function that maps the input data into a new representation. The last layer is a binary classifier. This means that if we input any input through the first L layer, we will get a new representation that is easily classified by the last layer of the discriminator. When the discriminator is very good at detecting the real input, the representation at L layer is going to be much easier to classify compared to the input data. It means that a square error between a transformed input and its transformed reconstruction input should be somewhat small when these inputs are similar.

This model has trained the same fashion as GANs; simultaneously train VAE and GANs. This idea works well for an image because the square-error is not a good metric for an image quality. This idea may work on text dataset as well because we assess the quality of the reconstructed texts based on the whole input but not collectively evaluate one word at a time.



Adversarial Variational Bayes

Variational autoencoder (VAE) requires an expressive inference network in order to learn a complex posterior distribution. The more complex inference network will result in generating high-quality data.

This work utilizes an adversarial training to learn a function T(x,z) that approximates \log q_{\phi}(z|x) - \log p(z). The expectation of this term w.r.t $latex q_{\phi}(z|x)$ is in fact a KL-divergence term. Since the authors prove that the optimal T^*(x, z) = \log q_{\phi}(z|x) - \log p(z), the ELBO becomes:

E_{p_{D(x)}}E_{q_{\phi}(z|x)}[-T^*(x,z) + \log p_{\theta}(x|z)]

In order to approximate T^*(x, z), the discriminator needs to learn to distinguish between a sample from a prior p_{D(x)}p(z) and the current inference model p_{D(x)}q_{\phi}(z|x). Thus, the objective function for the discriminator is setup as:

\max_T E_{p_{D(x)}}E_{q_{\phi(z|x)}} \log \sigma(T(x,z)) + E_{p_{D(x)}}E_{p(z)} \log(1 - \sigma(T(x,z))) (1)

Taking a gradient on T(x,z) w.r.t. parameter \phi can be problematic because the solution of this function depends on q_{\phi}(z|x). But the author shows that the expectation of gradient of T^*(x, z) w.r.t \phi is 0. Thus, there is no gradient and no parameter update when taking a gradient of T^*(x,z).

Since T(x,z) requires sample z, the parametrization trick is applied and the ELBO becomes:

E_{p_{D(x)}}E_{\epsilon}[-T^*(x, z_{\phi}(x, \epsilon) + \log p_{\theta}(x|z_{\phi}(x, \epsilon))] (2)

This step is crucial because now the sampling is just a transformation from a noise and let T^*(x, z) to approximate the KL-divergence term. This made this model looks like a blackbox model because we do not explicitly define a distribution q_{\phi}(z|x).

This model optimizes equation (1) and (2) using adversarial training. It optimizes eq.(1) several steps in order to keep T(x, z) close to optimal while jointly optimizes eq. (2).

Adaptive contrast technique is used to make T(x, y) to be sufficiently close to the optimal. Basically, the KL term in ELBO is replaced by KL(q_{\phi}(z|x), r_{\alpha}(z|x)) where r_{\alpha}(z|x) is an auxiliary distribution which could be a Gaussian distribution.

This model has a connection to variational autoencoder, adversarial autoencoder, f-GANs, and BiGANs. A new training method for VAE via adversarial training allows us to use a flexible inference that approximate a true distribution over the latent vectors.



IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models (SIGIR’17)

This paper uses GANs framework to combine generative and discriminative information retrieval model. It shows a promising result on web search, item recommendations, and Q/A tasks.

Typically, many relevant models are classified into 2 types:

  • Generative retrieval model
    • It generates a document given a query and possibly relevant score. The model is p(d|q, r).
  • Discriminative retrieval model
    • It computes a relevance score for the given query and document pair. The model is p(r| d, q).

The generative model tried to find a connection between document and query. On the other hand, the discriminative model attempts to model the interaction between query and document based on relevance scores.

Both models have their shortcoming. Many generative models require a predefined data generating story. The wrong assumption will lead to the poor performance. The generative model is usually trying to fit the data to its model without external guidance. Meanwhile, the discriminative model requires a lot of labeled data to be effective, especially for a deep neural network model.

By train both models using GANs framework, it is now possible to solve their shortcoming. The generative model is now adaptive because the discriminator will reward the generative model when it can create or select good samples. This adaptive guidance from the discriminator is unique in GANs framework and will help the generator learns to pick good samples from the data distribution. At the same time, the discriminator can receive even more training data from the generative model. This is similar to semi-supervised learning where unlabeled data are utilized. Adversarial training allows us to improve both generative and discriminative models via jointly learning through the

Adversarial training allows us to improve both generative and discriminative models via jointly learning through the minimax training. The traditional training based on maximum likelihood does not have principle way to allow both models to give each other feedbacks.

The paper seems to be promising and their results on 3 information retrieval tasks are really good. But I notice that their training procedure requires pretraining. This made me wonder if pre-training is part of the performance boost during testing. I don’t find the part in the paper that explains the benefit of pretraining in their settings.

The discriminative model is straight forward. It is a sigmoid function. The discriminator basically gives a high probability when the given document-query pair is relevant. The generative model is more interesting. In the standard GANs, the generator will create a sample from a simple distribution, but IRGAN does not generate a new document-query pair. Instead, the author chose to let the generator select the sample from the document pool. In my opinion, this approach is simpler than creating a new data because the sample is realistic. Also, IRGAN cares about finding a function to compute a relevance score so it is unnecessary to generate a completely new data.

However, the cost function for the generator is an expectation over all documents in the corpus. The Monte Carlo approximation will have a high variance. Thus, they use policy gradient to reduce the variance so that the model can learn a useful representation. Although p(d|q,r) is a discrete distribution,  the backprop is applicable because we pre-sample all documents from p(d|q,r) beforehand. Thus, eq. 5 is differentiable. The extra care may need in order to reduce variance further. They use an advantage function. (Please look at the reference on Reinforcement Learning [2]).

Generating positive and negative samples are still confusing in this paper. It seems to be application specific. The author mentioned about using softmax with temperature hyper-parameter to put more or less focus on top documents. My guess is when we put less focus on top documents, the generator has more chance to pick up more negative samples. After I read the paper again, it seems that all samples selected by the generator model are negative samples. This part remains unclear and I need to ask the author for more details.

In conclusion, I like this paper because it tried to combine generative and discriminative retrieval models via GANs framework. I would not be appreciated if they simply applied the IR task to GANs blindly. Instead, they explained their motivation and discussed the advantage of jointly train both models. It seems adversarial training is useful for IR tasks as well.


[1] Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models. In Proceedings of SIGIR’17, Shinjuku, Tokyo, Japan, August 7-11, 2017, 10 pages.

[2] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, and others. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NIPS.

[3] IRGAN code (


My takes on GANs (Generative Adversarial Nets)

What is the fuss with GANs? Everybody loves GANs now. The joke I heard from my advisor is “If your current model does not work, try it aGAN !” (pun-intended).  I realize that it comes down to its adversarial objective function. Some people may think it is sexier than VAE and NADE’s objective function in which they just maximize the log-probability – the likelihood that the model will maximize the given data samples. For VAE, we will maximize the evidence lower-bound (ELBO) while NADE will maximize the conditional log probability. GANs simultaneously maximize and minimize two cost functions.

GANs’ framework has two components: data generator and discriminator. The generator will attempt to create a sample that is hopefully similar to sample from the actual data while the discriminator needs to guess if the given sample is fake or real. This process can be viewed as a competitive game where a generator tries to fool the discriminator. The game is complete when the discriminator is unable to distinguish a generated sample from a real one. On the other hand, we can also look at GANs as a cooperating game where discriminator helps the generator to create a more realistic sample by providing a feedback.

There are 2 cost functions: one for the generator and another for the discriminator. The cost function for the discriminator is a binary cross entropy:

J^{(D)}(\theta^{(D)}, \theta^{(G)}) = -\frac{1}{2}E_{x \sim p_{\text{data}}}[\log D(x)] - \frac{1}{2}E_z[\log(1 - D(G(z)))]

There are two set of mini-batches: one from the real samples and one coming from the generator. The optimal discriminator will predict the chance of being a true sample as \frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}}.

There are many cost functions for a generator. The original paper proposed 2 cost functions. If we construct this learning procedure as a zero-sum game, then the cost function for the generator is simply, J^{(G)} = -J^{(D)}. We simply want to minimize the likelihood of the discriminator for being correct.

Goodfellow [1] mentioned that the above cost function has a vanish gradient problem when the sample is likely to be fake. Hence, to overcome this, he proposed a new cost function for the generator by maximizing the likelihood of the discriminator for being wrong. The new cost function is:

J^{(G)} = -\frac{1}{2}E_z[\log D(G(z))]

The new cost function is more useful in practice but the former function is useful for theoretical analysis, in which we will discuss next. For now, the objective function is:

V(D,G) = \min_G \max_D J^{(D)}(\theta^{(D)}, \theta^{(G)})

Goodfellow shows that learning the zero-sum game is the same as minimizing the Jensen-Shannon divergence. Given that we have an optimal discriminator, the cost function is now:

C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log D^*(x)] + E_z[\log(1 - D^*(G(z)))]

Since we know the optimal discriminator, we now have:

C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log (\frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}})] + E_z[\log(\frac{p_{\text{model}}}{p_{\text{data}} + p_{\text{model}}})]

Then, simplify further:

C(G) = \text{KL}(p_{\text{data}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \text{KL}(p_{\text{model}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \log({\frac{1}{4}})

At the global minimum, we will have $latex p_{\text{data}} = p_{\text{model}}$, hence:

C^*(G) =  \log({\frac{1}{4}})

People believed that minimizing the Jensen-Shannon divergence is the reason why GANs can produce a sharp image while VAE produces blurry image since it minimizes KL divergence. However, this belief is no longer true [2].

However, GANs has received a lot of criticism. For one instance, Schmidhuber mentioned that Goodfellow should have cited Schmidhuber’s 1992 paper [3]. (I found Reddit’s article on the public attack during NIPS2016 tutorial here: ). Interestingly, Schmidhuber is one of the reviewers on GANs original paper and he rejected it.

While many researchers claim that GANs generates a realistic sample, a few NLP researchers do not believe so [4]. Meanwhile,  the standard dataset for some particular tasks is necessary because many machine learning research paper is really depending on the empirical results.


[1] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.

[2] Goodfellow, Ian. “NIPS 2016 Tutorial: Generative Adversarial Networks.” arXiv preprint arXiv:1701.00160 (2016).




Neural Autoregressive Distribution Estimation (NADE)

The generative model estimates P(x) or P(x,z) which is different from the discriminative model which estimates a conditional probability P(z|x) directly. An autoregressive model is one of three popular approaches in deep generative models beside GANs and VAE. It models P(x)  = \prod_i P(x_i | x_{<i})

It models P(x)  = \prod_i P(x_i | x_{<i}) – it factors a joint distribution of an observation as a product of an independent conditional probability distribution. However, implement this model to approximate these distributions directly is intractable because we need parameters for each observation.

NADE [1] proposed a scalable approximation by sharing weight parameters between each observation. The sharing parameters method reduces the number of free parameters and has an effect of regularization because the weight parameters must accommodate for all observation.

For technical details, NADE models the following distribution:

P(x) = \prod_{d=1}^D P(x_{o_d}|x_{o_<d}) where d is an index of the permutation o. For example, if we have x = {x_1, x_2, x_3, x_4}, and o = {2, 3, 4, 1}, then x_{o_1} = x_2. The permutation of the observation is more generic notations. Once we model the observation, the hidden variables can be computed as:

\vec h_d = \sigma(W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c)  And we can generate the observation ( a binary random variable) using a sigmoid function:

P(x_{o_d} = 1| \vec x_{o_{<d}}) = \sigma(V_{o_d}\vec h_d + b_{o_d})

If we look at NADE’s architecture, it is similar to RBM. In fact, [1] shows that NADE is in fact, a mean-field approximation of RBM (see [1] for details).

Another important property of NADE is computing a joint distribution P(x) is linear to the number of dimension of observations because we can express W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c recursively:

Define the basecase as: \vec a_1 = \vec c

And the recurrent relationship as: \vec a_d =  W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c = W_{:,o_{d-1}} x_{o_{d-1}} + \vec a_{d-1}

This means that computing \vec h_d = \sigma(\vec a_d) can be done in a linear fashion.

There are many extension of the Autoregressive model, one of the extension CF-NADE is currently the state-of-the-arts of CF. This model can model a binary/discrete random variable which VAE is unable to model currently. So this can be useful for any problem that requires a discrete random variable.


[1] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” Journal of Machine Learning Research 17.205 (2016): 1-37.

Restricted Boltzmann Machine (RBM)

The classical work in a generative model that has a connection to an artificial neural network is the Boltzmann Machine [1] and Restricted Boltzmann Machine (RBM) [2]. The deep version of RBM [3] shows many successes in pretraining the parameters of the neural networks. There are many good tutorials on RBM and source code that we can find online, my goal for this post is to point out some important aspects of RBM as my review material before posting about more complicated models such as NADE [4].

My favorite tutorial on RBM is by Fischer and Christian [5]. It covers many key ideas for modeling and training RBM. My post is based on this tutorial paper.

What I understand about the RBM is that we are modeling a joint probability of visible and hidden units, p(\bf v, \bf h). RBM removes all connections between hidden units, and result in an efficient learning because all hidden units are conditional independence given visible units.


A graphical model of RBM. Blue nodes are visible units and Yellow nodes are hidden units. We want to infer hidden units from visible units.

One way to define the joint probability is to use Gibbs distribution and it defines p(\bf v, \bf h) = \frac{1}{Z} e^{-E(\bf v, \bf h)}, where Z is a normalization constant or a partition function, Z = \sum_{\bf v, \bf h} e^{-E(\bf v, \bf h)}. It important to realize that computing the normalization constant is intractable because we have to enumerate all possible \bf v and \bf h.

The gradient of log-likelihood is \log L(\bf \theta | \bf v) =  \log p(\bf v; \bf \theta) = \log \frac{1}{Z} \sum_{\bf h} e^{-E(\bf v, \bf h)} =\log \sum_{\bf h} e^{-E(\bf v, \bf h)}  -\log \sum_{\bf v, \bf h} e^{-E(\bf v, \bf h)}. Once we take a derivative w.r.t. to the model parameter, \bf \theta, the gradient is:

\frac{\partial \log L(\bf \theta | \bf v)}{\partial \bf \theta} = - \sum_{\bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial \theta} + \sum_{\bf v, \bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial \theta} This expression shows that the gradient is the difference between an expected energy under the model distribution and under the posterial distribution. Directly evaluating these summation is intractable.

RBM defines an energy function as E(\bf v, \bf h) = -\bf h^T W \bf v - \bf b^T \bf v - \bf c^T \bf h We can derive the gradient w.r.t. to w_{ij}. First, we derive \sum_{\bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial w_{ij}} = \sigma( \sum_{j=1}^m w_{ij}v_j + c_i)v_j = p(H_i = 1|\bf v)v_j . Then, the gradient is become:

 \frac{\partial \log L(\bf \theta | \bf v)}{\partial w_{ij}} = p(H_i=1|\bf v)v_j - \sum_{\bf v} p(\bf v)p(H_i = 1|\bf v)v_j

We can also show that p(H_i=1| \bf v) is a sigmoid function. A similar derivation can be applied for p(V_i =1|\bf h) = \sigma(\sum_{i=1}^n w_{ij}h_i + b_j) . Due to these facts, RBM can be interpreted as a stochastic neural network where the probability of a hidden node being one is dictated by p(H_i =1 | \bf v).

Training RBM is tricky because we can’t directly evaluate the expectation term. One common approximation is to use  Gibbs sampling to sample the most probable hidden and variable units from p(\bf v, \bf h) Then, we can approximate the second term. However, running Gibbs sampling requires many iterations in order to reach a stationary point, Hinton proposed a contrastive divergence where running Gibbs sampling only for k steps.

RBM is the classical stochastic neutral network where it models a joint distribution between visible and hidden units. Training RBM is tricky and requires an approximation of the second log-likelihood term. The modern models such as NADE and VAE turn stochastic layers in a neural network to a deterministic system and thus training can be done through a standard backpropagation. This might be one of the reasons why NADE and VAE are more popular than RBM.


[1] Hinton, Geoffrey E., and Terrence J. Sejnowski. “Optimal perceptual inference.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE New York, 1983.

[2] Hinton, Geoffrey E. “Training products of experts by minimizing contrastive divergence.” Neural computation 14.8 (2002): 1771-1800.

[3] Salakhutdinov, Ruslan, and Geoffrey E. Hinton. “Deep Boltzmann Machines.” AISTATS. Vol. 1. 2009.

[4] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” Journal of Machine Learning Research 17.205 (2016): 1-37.

[5] Fischer, Asja, and Christian Igel. “An introduction to restricted Boltzmann machines.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (2012): 14-36.