My takes on GANs (Generative Adversarial Nets)

What is the fuss with GANs? Everybody loves GANs now. The joke I heard from my advisor is “If your current model does not work, try it aGAN !” (pun-intended).  I realize that it comes down to its adversarial objective function. Some people may think it is sexier than VAE and NADE’s objective function in which they just maximize the log-probability – the likelihood that the model will maximize the given data samples. For VAE, we will maximize the evidence lower-bound (ELBO) while NADE will maximize the conditional log probability. GANs simultaneously maximize and minimize two cost functions.

GANs’ framework has two components: data generator and discriminator. The generator will attempt to create a sample that is hopefully similar to sample from the actual data while the discriminator needs to guess if the given sample is fake or real. This process can be viewed as a competitive game where a generator tries to fool the discriminator. The game is complete when the discriminator is unable to distinguish a generated sample from a real one. On the other hand, we can also look at GANs as a cooperating game where discriminator helps the generator to create a more realistic sample by providing a feedback.

There are 2 cost functions: one for the generator and another for the discriminator. The cost function for the discriminator is a binary cross entropy:

J^{(D)}(\theta^{(D)}, \theta^{(G)}) = -\frac{1}{2}E_{x \sim p_{\text{data}}}[\log D(x)] - \frac{1}{2}E_z[\log(1 - D(G(z)))]

There are two set of mini-batches: one from the real samples and one coming from the generator. The optimal discriminator will predict the chance of being a true sample as \frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}}.

There are many cost functions for a generator. The original paper proposed 2 cost functions. If we construct this learning procedure as a zero-sum game, then the cost function for the generator is simply, J^{(G)} = -J^{(D)}. We simply want to minimize the likelihood of the discriminator for being correct.

Goodfellow [1] mentioned that the above cost function has a vanish gradient problem when the sample is likely to be fake. Hence, to overcome this, he proposed a new cost function for the generator by maximizing the likelihood of the discriminator for being wrong. The new cost function is:

J^{(G)} = -\frac{1}{2}E_z[\log D(G(z))]

The new cost function is more useful in practice but the former function is useful for theoretical analysis, in which we will discuss next. For now, the objective function is:

V(D,G) = \min_G \max_D J^{(D)}(\theta^{(D)}, \theta^{(G)})

Goodfellow shows that learning the zero-sum game is the same as minimizing the Jensen-Shannon divergence. Given that we have an optimal discriminator, the cost function is now:

C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log D^*(x)] + E_z[\log(1 - D^*(G(z)))]

Since we know the optimal discriminator, we now have:

C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log (\frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}})] + E_z[\log(\frac{p_{\text{model}}}{p_{\text{data}} + p_{\text{model}}})]

Then, simplify further:

C(G) = \text{KL}(p_{\text{data}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \text{KL}(p_{\text{model}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \log({\frac{1}{4}})

At the global minimum, we will have $latex p_{\text{data}} = p_{\text{model}}$, hence:

C^*(G) =  \log({\frac{1}{4}})

People believed that minimizing the Jensen-Shannon divergence is the reason why GANs can produce a sharp image while VAE produces blurry image since it minimizes KL divergence. However, this belief is no longer true [2].

However, GANs has received a lot of criticism. For one instance, Schmidhuber mentioned that Goodfellow should have cited Schmidhuber’s 1992 paper [3]. (I found Reddit’s article on the public attack during NIPS2016 tutorial here: ). Interestingly, Schmidhuber is one of the reviewers on GANs original paper and he rejected it.

While many researchers claim that GANs generates a realistic sample, a few NLP researchers do not believe so [4]. Meanwhile,  the standard dataset for some particular tasks is necessary because many machine learning research paper is really depending on the empirical results.


[1] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.

[2] Goodfellow, Ian. “NIPS 2016 Tutorial: Generative Adversarial Networks.” arXiv preprint arXiv:1701.00160 (2016).