Paper Reading: Semantic Hashing using Tags and Topic Modeling (SIGIR13)

This paper is similar to collaborative topic modeling. They also mentioned that in their paper. The main difference is they train LDA offline so the topic distribution is not affected by tagging (supervisory signals).

From probabilistic perspective, it is variant of CTM:

Train topic model offline.

  1. For each doc y_j in Corpus:
    1. Draw latent offset, e ~ Gaussian(0, 1)
    2. Draw document latent, y ~ e + topic(y)
  2. For each tag u_i:
    1. Draw tag, u_i, from Gaussian(0, 1)
  3. For each doc y_j and tag u_i:
    1. Draw a document label, t_{i,j} ~ Gaussian( u_i * y_j , confidence)

I think the confidence matrix that they create is simply a fixed variance.

From matrix factorization perspective:
1. They factor a tagging matrix: T = U * Y
2. Add regularizer || U ||^2 to keep U small since U is drawn from Gaussian(0,1).
3. Add regularizer that penalizes when Y is deviate from a topic(Y).

Their binarization from Y to hashcode is very straight forward.

Based on my analysis, there are a lot of rooms to improve:
1. adding non-linearity to matrix factorization, hash code function, and binarization function.
2. tagging uncertainty is learned from the data (instead of fixed variances).
3. tags (supervisory signal) can be correlated ( so document without tag t, can still have a high probability for tag t, if similar tag s appears in the doc.
4. jointly learn a topic model.


[1] Wang, Qifan, Dan Zhang, and Luo Si. “Semantic hashing using tags and topic modeling.” Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013.