# Deep Semantic Hashing with Generative Adversarial Networks (SIGIR17)

This paper proposed a deep semantic hashing model that utilizes a generative adversarial network for learning a compact binary code for similar image search task. In semantic hashing research, the supervised models where learning a hash function from labeled images produces more accurate results than unsupervised models. The main reason is that the similarity criteria is not always based on the image contents. It depends on image annotators on how he/she interprets the given image. But obtaining labeled images are expensive, thus, it is important to develop the model that requires less labeled images but still yields a comparable result to supervised model.

Hence, the goal of this model is to generate more synthetic images in order to learn a better hash function. It leverages the GANs idea by training a conditional generator and discriminator. Intuitively, once we have a good generator that can produce an image given class labels, then we will have more image samples with known labels. GANs framework fits with this semi-supervised learning ideas.

The main model, which is DSH-GANs (Deep Semantic Hashing with GANs), has a generator and discriminator. The input real image must have labels so that we can control the generator to produce similar images (positive samples). We also need to somehow select non-relevant labels so that the generator will produce dissimilar images ( negative samples). Then, these images will go through deep CNN to learn low-level features. Finally, the feature learned from the last layer of CNN will be fed to 3 separated networks. The first one is a hash stream, which is necessary to learn a hash function. They use maximum margin as a rank loss function. The second network is classification stream. This network is to force images with similar labels to have a similar hash code. For example, two completely different images may have the same labels. Thus, this network allows this two image to end up with the same hash code. Finally, the adversary stream which is part of GANs framework is necessary because we want to adversarial train DSH-GANs.

It is important to pre-train the generator first. The author does not explain the main reason but I think we want the generator at least generate somewhat useful image samples. The pre-train might be helpful for the GANs training to converge to a good local optimal. Pre-training the generator is also based on advertorial training. We want to the generator to generate an image that preserves label information while the discriminator must learn to detect a synthetic image and be able to predict the class label. This pre-training step also utilizes unlabeled images. For any image with no labels, the author set the class label as a zero vector, so that the discriminator just needs to predict all zeros. It seems there is no good reason why choosing zeros because it means that the generator will treat unlabeled images as having labels that are outside the label sets from the corpus.

I think this model utilizes GANs idea well, especially for semi-supervised learning. The key contribution is to allow GANs to generate more image samples both positive and negative samples.

# Autoencoding beyond pixels using a learned similarity metric (ICML’16)

One of the key components of an autoencoder is a reconstruction error. This term measures how much useful information is compressed into a learned latent vector. The common reconstruction error is based on an element-wise measurement such as a binary cross entropy for a black-and-white image or a square error between a reconstructed image and the input image.

The authors think that an element-wise measurement is not an accurate indicator of goodness of a learned latent vector. Hence, they proposed to learn a similar metric via an adversarial training. Here is how they set up the objective functions:

They use VAE to learn a latent vector of the input image. There are two loss functions in a vanilla VAE: a KL loss and a negative log data-likelihood. They replace the second loss with a new reconstruction loss function. We will talk about this new loss function in the next paragraph. Then, they have a discriminator that tries to distinguish the real input data from the generated data from the decoder of VAE. The discriminator will encourage the VAE to learn stronger encoder and decoder.

The discriminator can be decomposed into 2 parts. If it has L + 1 layers, then the first L layers is a transform function that maps the input data into a new representation. The last layer is a binary classifier. This means that if we input any input through the first L layer, we will get a new representation that is easily classified by the last layer of the discriminator. When the discriminator is very good at detecting the real input, the representation at L layer is going to be much easier to classify compared to the input data. It means that a square error between a transformed input and its transformed reconstruction input should be somewhat small when these inputs are similar.

This model has trained the same fashion as GANs; simultaneously train VAE and GANs. This idea works well for an image because the square-error is not a good metric for an image quality. This idea may work on text dataset as well because we assess the quality of the reconstructed texts based on the whole input but not collectively evaluate one word at a time.

References:

https://arxiv.org/abs/1512.09300

Variational autoencoder (VAE) requires an expressive inference network in order to learn a complex posterior distribution. The more complex inference network will result in generating high-quality data.

This work utilizes an adversarial training to learn a function $T(x,z)$ that approximates $\log q_{\phi}(z|x) - \log p(z)$. The expectation of this term w.r.t $latex q_{\phi}(z|x)$ is in fact a KL-divergence term. Since the authors prove that the optimal $T^*(x, z) = \log q_{\phi}(z|x) - \log p(z)$, the ELBO becomes:

$E_{p_{D(x)}}E_{q_{\phi}(z|x)}[-T^*(x,z) + \log p_{\theta}(x|z)]$

In order to approximate $T^*(x, z)$, the discriminator needs to learn to distinguish between a sample from a prior $p_{D(x)}p(z)$ and the current inference model $p_{D(x)}q_{\phi}(z|x)$. Thus, the objective function for the discriminator is setup as:

$\max_T E_{p_{D(x)}}E_{q_{\phi(z|x)}} \log \sigma(T(x,z)) + E_{p_{D(x)}}E_{p(z)} \log(1 - \sigma(T(x,z)))$ (1)

Taking a gradient on $T(x,z)$ w.r.t. parameter $\phi$ can be problematic because the solution of this function depends on $q_{\phi}(z|x)$. But the author shows that the expectation of gradient of $T^*(x, z)$ w.r.t $\phi$ is 0. Thus, there is no gradient and no parameter update when taking a gradient of $T^*(x,z)$.

Since $T(x,z)$ requires sample z, the parametrization trick is applied and the ELBO becomes:

$E_{p_{D(x)}}E_{\epsilon}[-T^*(x, z_{\phi}(x, \epsilon) + \log p_{\theta}(x|z_{\phi}(x, \epsilon))]$ (2)

This step is crucial because now the sampling is just a transformation from a noise and let $T^*(x, z)$ to approximate the KL-divergence term. This made this model looks like a blackbox model because we do not explicitly define a distribution $q_{\phi}(z|x)$.

This model optimizes equation (1) and (2) using adversarial training. It optimizes eq.(1) several steps in order to keep $T(x, z)$ close to optimal while jointly optimizes eq. (2).

Adaptive contrast technique is used to make $T(x, y)$ to be sufficiently close to the optimal. Basically, the KL term in ELBO is replaced by $KL(q_{\phi}(z|x), r_{\alpha}(z|x))$ where $r_{\alpha}(z|x)$ is an auxiliary distribution which could be a Gaussian distribution.

This model has a connection to variational autoencoder, adversarial autoencoder, f-GANs, and BiGANs. A new training method for VAE via adversarial training allows us to use a flexible inference that approximate a true distribution over the latent vectors.

References:

# My takes on GANs (Generative Adversarial Nets)

What is the fuss with GANs? Everybody loves GANs now. The joke I heard from my advisor is “If your current model does not work, try it aGAN !” (pun-intended).  I realize that it comes down to its adversarial objective function. Some people may think it is sexier than VAE and NADE’s objective function in which they just maximize the log-probability – the likelihood that the model will maximize the given data samples. For VAE, we will maximize the evidence lower-bound (ELBO) while NADE will maximize the conditional log probability. GANs simultaneously maximize and minimize two cost functions.

GANs’ framework has two components: data generator and discriminator. The generator will attempt to create a sample that is hopefully similar to sample from the actual data while the discriminator needs to guess if the given sample is fake or real. This process can be viewed as a competitive game where a generator tries to fool the discriminator. The game is complete when the discriminator is unable to distinguish a generated sample from a real one. On the other hand, we can also look at GANs as a cooperating game where discriminator helps the generator to create a more realistic sample by providing a feedback.

There are 2 cost functions: one for the generator and another for the discriminator. The cost function for the discriminator is a binary cross entropy:

$J^{(D)}(\theta^{(D)}, \theta^{(G)}) = -\frac{1}{2}E_{x \sim p_{\text{data}}}[\log D(x)] - \frac{1}{2}E_z[\log(1 - D(G(z)))]$

There are two set of mini-batches: one from the real samples and one coming from the generator. The optimal discriminator will predict the chance of being a true sample as $\frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}}$.

There are many cost functions for a generator. The original paper proposed 2 cost functions. If we construct this learning procedure as a zero-sum game, then the cost function for the generator is simply, $J^{(G)} = -J^{(D)}$. We simply want to minimize the likelihood of the discriminator for being correct.

Goodfellow [1] mentioned that the above cost function has a vanish gradient problem when the sample is likely to be fake. Hence, to overcome this, he proposed a new cost function for the generator by maximizing the likelihood of the discriminator for being wrong. The new cost function is:

$J^{(G)} = -\frac{1}{2}E_z[\log D(G(z))]$

The new cost function is more useful in practice but the former function is useful for theoretical analysis, in which we will discuss next. For now, the objective function is:

$V(D,G) = \min_G \max_D J^{(D)}(\theta^{(D)}, \theta^{(G)})$

Goodfellow shows that learning the zero-sum game is the same as minimizing the Jensen-Shannon divergence. Given that we have an optimal discriminator, the cost function is now:

$C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log D^*(x)] + E_z[\log(1 - D^*(G(z)))]$

Since we know the optimal discriminator, we now have:

$C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log (\frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}})] + E_z[\log(\frac{p_{\text{model}}}{p_{\text{data}} + p_{\text{model}}})]$

Then, simplify further:

$C(G) = \text{KL}(p_{\text{data}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \text{KL}(p_{\text{model}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \log({\frac{1}{4}})$

At the global minimum, we will have $latex p_{\text{data}} = p_{\text{model}}$, hence:

$C^*(G) = \log({\frac{1}{4}})$

People believed that minimizing the Jensen-Shannon divergence is the reason why GANs can produce a sharp image while VAE produces blurry image since it minimizes KL divergence. However, this belief is no longer true [2].

However, GANs has received a lot of criticism. For one instance, Schmidhuber mentioned that Goodfellow should have cited Schmidhuber’s 1992 paper [3]. (I found Reddit’s article on the public attack during NIPS2016 tutorial here: https://www.reddit.com/r/MachineLearning/comments/5go4sa/n_whats_happening_at_nips_2016_jurgen_schmidhuber/ ). Interestingly, Schmidhuber is one of the reviewers on GANs original paper and he rejected it.

While many researchers claim that GANs generates a realistic sample, a few NLP researchers do not believe so [4]. Meanwhile,  the standard dataset for some particular tasks is necessary because many machine learning research paper is really depending on the empirical results.

References:

[1] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.

[2] Goodfellow, Ian. “NIPS 2016 Tutorial: Generative Adversarial Networks.” arXiv preprint arXiv:1701.00160 (2016).

[3] ftp://ftp.idsia.ch/pub/juergen/factorial.pdf