Deep Semantic Hashing with Generative Adversarial Networks (SIGIR17)

This paper proposed a deep semantic hashing model that utilizes a generative adversarial network for learning a compact binary code for similar image search task. In semantic hashing research, the supervised models where learning a hash function from labeled images produces more accurate results than unsupervised models. The main reason is that the similarity criteria is not always based on the image contents. It depends on image annotators on how he/she interprets the given image. But obtaining labeled images are expensive, thus, it is important to develop the model that requires less labeled images but still yields a comparable result to supervised model.

Hence, the goal of this model is to generate more synthetic images in order to learn a better hash function. It leverages the GANs idea by training a conditional generator and discriminator. Intuitively, once we have a good generator that can produce an image given class labels, then we will have more image samples with known labels. GANs framework fits with this semi-supervised learning ideas.

The main model, which is DSH-GANs (Deep Semantic Hashing with GANs), has a generator and discriminator. The input real image must have labels so that we can control the generator to produce similar images (positive samples). We also need to somehow select non-relevant labels so that the generator will produce dissimilar images ( negative samples). Then, these images will go through deep CNN to learn low-level features. Finally, the feature learned from the last layer of CNN will be fed to 3 separated networks. The first one is a hash stream, which is necessary to learn a hash function. They use maximum margin as a rank loss function. The second network is classification stream. This network is to force images with similar labels to have a similar hash code. For example, two completely different images may have the same labels. Thus, this network allows this two image to end up with the same hash code. Finally, the adversary stream which is part of GANs framework is necessary because we want to adversarial train DSH-GANs.

It is important to pre-train the generator first. The author does not explain the main reason but I think we want the generator at least generate somewhat useful image samples. The pre-train might be helpful for the GANs training to converge to a good local optimal. Pre-training the generator is also based on advertorial training. We want to the generator to generate an image that preserves label information while the discriminator must learn to detect a synthetic image and be able to predict the class label. This pre-training step also utilizes unlabeled images. For any image with no labels, the author set the class label as a zero vector, so that the discriminator just needs to predict all zeros. It seems there is no good reason why choosing zeros because it means that the generator will treat unlabeled images as having labels that are outside the label sets from the corpus.

I think this model utilizes GANs idea well, especially for semi-supervised learning. The key contribution is to allow GANs to generate more image samples both positive and negative samples.

Paper Reading: Semantic Hashing using Tags and Topic Modeling (SIGIR13)

This paper is similar to collaborative topic modeling. They also mentioned that in their paper. The main difference is they train LDA offline so the topic distribution is not affected by tagging (supervisory signals).

From probabilistic perspective, it is variant of CTM:

Train topic model offline.

1. For each doc y_j in Corpus:
1. Draw latent offset, e ~ Gaussian(0, 1)
2. Draw document latent, y ~ e + topic(y)
2. For each tag u_i:
1. Draw tag, u_i, from Gaussian(0, 1)
3. For each doc y_j and tag u_i:
1. Draw a document label, t_{i,j} ~ Gaussian( u_i * y_j , confidence)

I think the confidence matrix that they create is simply a fixed variance.

From matrix factorization perspective:
1. They factor a tagging matrix: T = U * Y
2. Add regularizer || U ||^2 to keep U small since U is drawn from Gaussian(0,1).
3. Add regularizer that penalizes when Y is deviate from a topic(Y).

Their binarization from Y to hashcode is very straight forward.

Based on my analysis, there are a lot of rooms to improve:
1. adding non-linearity to matrix factorization, hash code function, and binarization function.
2. tagging uncertainty is learned from the data (instead of fixed variances).
3. tags (supervisory signal) can be correlated ( so document without tag t, can still have a high probability for tag t, if similar tag s appears in the doc.
4. jointly learn a topic model.
References

References:

[1] Wang, Qifan, Dan Zhang, and Luo Si. “Semantic hashing using tags and topic modeling.” Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013.

Supervised Hashing for Image Retrieval

This paper proposed a supervised hashing method that learns hash codes from the images and a hash function using convolution neural network (CNN).

Ideally, it would be nice if the model can simultaneously learn a hash function from the train images and class labels. If they only set up CNN by learning a function that maps a raw image to class labels, then they won’t learn any good representation but instead learning a discriminative model.

Instead, they need to approximate hash codes for training data from the similarity matrix. Defining a matrix $S_{i,j} = +1$ if image i is semantically similar to image j. Otherwise, $S_{i,j} = -1$. Then they cleverly assume that a dot product between two hash code (binary vector) is approximately equivalent to +q if both vectors are similar. q is a dimension of a binary vector. The objective function then becomes a matrix factorization problem:

$\min_{H} || S - \frac{1}{q}HH^T||_F^2$ where $H \in \{-1, 1 \}^{n x q}$

It is very difficult to efficiently optimize the above objective function due to an integer constraint. The authors relax this restriction by allowing a range constraint, $H \in [-1, 1 ]^{n x q}$, which is easier to optimize using a coordinate descent algorithm using Newton direction. They contribution is the derivation of efficient optimization algorithms that learns an approximate hashcode from the range constraint.

Once hash codes from train data are obtained, the next step is to determine a hash function. They construct CNN and simultaneously learns a function that maps a raw image to an image representation (encoder) and a hash function that maps this representation to the given approximate hash codes. The author called this as the CNNH model.

By adding label information as additional output layers, now CNN needs to learn an image representation that matches the given hash codes and label matrix. This additional constraint forces the CNN to learn a shared image representation. The author called this as CNNH+ model.

The CNNH+ model has the highest MAP on MNIST, CIFAR10, and NUS_WIDE datasets.

I think using deep learning architecture is trivial but learning an approximate hash code is much more challenging when the objective function is a range constraint. The authors demonstrated that the proposed optimization method is faster and more scalable than BCD algorithm. I believe BCD [2] is the state-of-the-arts method at that time.

References:

[1] Xia, Rongkai, et al. “Supervised Hashing for Image Retrieval via Image Representation Learning.” AAAI. Vol. 1. 2014.

[2] Lin, Guosheng, et al. “A general two-step approach to learning-based hashing.” Proceedings of the IEEE International Conference on Computer Vision. 2013.