Supervised Hashing for Image Retrieval (AAAI’14)

This paper proposed a supervised hashing method that learns hash codes from the images and a hash function using convolution neural network (CNN).

Ideally, it would be nice if the model can simultaneously learn a hash function from the train images and class labels. If they only set up CNN by learning a function that maps a raw image to class labels, then they won’t learn any good representation but instead learning a discriminative model.

Instead, they need to approximate hash codes for training data from the similarity matrix. Defining a matrix $S_{i,j} = +1$ if image i is semantically similar to image j. Otherwise, $S_{i,j} = -1$. Then they cleverly assume that a dot product between two hash code ( a binary vector) is approximately equivalent to +q if both vectors are similar. q is a dimension of a binary vector. The objective function then becomes a matrix factorization problem:

$\min_{H} || S - \frac{1}{q}HH^T||_F^2$ where $H \in \{-1, 1 \}^{n x q}$

It is very difficult to efficiently optimize the above objective function due to an integer constraint. The authors relax this restriction by allowing a range constraint, $H \in [-1, 1 ]^{n x q}$, which is easier to optimize using a coordinate descent algorithm using Newton direction. The contribution is the derivation of efficient optimization algorithms that learn an approximate hashcode from the range constraint.

Once hash codes from train data are obtained, the next step is to determine a hash function. They construct CNN and simultaneously learns a function that maps a raw image to an image representation (encoder) and a hash function that maps this representation to the given approximate hash codes. The author called this as the CNNH model.

By adding label information as additional output layers, now CNN needs to learn an image representation that matches the given hash codes and label matrix. This additional constraint forces the CNN to learn a shared image representation. The author called this as CNNH+ model.

The CNNH+ model has the highest MAP on MNIST, CIFAR10, and NUS_WIDE datasets.

I think using deep learning architecture is trivial but learning an approximate hash code is much more challenging when the objective function is a range constraint. The authors demonstrated that the proposed optimization method is faster and more scalable than BCD algorithm. I believe BCD [2] is the state-of-the-art method at that time.

References:

[1] Xia, Rongkai, et al. “Supervised Hashing for Image Retrieval via Image Representation Learning.” AAAI. Vol. 1. 2014.

[2] Lin, Guosheng, et al. “A general two-step approach to learning-based hashing.” Proceedings of the IEEE International Conference on Computer Vision. 2013.