# Learning a Hierarchical Embedding Model for Personalized Product Search (SIGIR’17)

This paper proposed a personalized product search model. Given a user and query, the model needs to retrieve a relevant product that a user will likely to purchase. Due to this requirement, the relevancy is based on the likelihood of the user will buy retrieved products. This can be finding a product that meets a user’s intent or need.

The model uses product reviews written by users to model a user’s preference. The authors assume that a query is independent (not personalized). It means by observing a query alone, we can’t tell which user creates that query. This assumption is necessary because the dataset does not have queries and the authors need to construct a set of queries based on the product’s description. On the other hands, The item embedding is also based on its product reviews. By looking all reviews, the model should be able to summarize the item and be able to learn its embedding vector.

The last information is a query. The model also learns word embedding for queries and then construct a query based on either averaging all words, non-linear projecting word embedding or simply use RNN to get a final embedding vector. It means that this model will learn word embedding for all words observed in the pseudo-queries, word embedding for all users and items.

But they need to relate query, user, and item. And I think this is the key assumption of this model. They assume that a linear combination of user and query should be similar to a relevant item. Hence, the model assumes knowing user and query, we should have a good idea what should be relevant products/items. As a result, the objective function is to maximize the log likelihood of the language model of all users, the language model of all items, and the item generation likelihood.

The neural networks are utilized in their model. To construct a query vector, projecting an average word embedding yields the best accuracy. However, this model requires approximating a lot of language models. This could be challenging to train. Thus, the author approximates the softmax calculation based on negative sampling technique proposed by [2]. The experiment results seem valid but this model performs poorly on some product searches. But this model beats the state-of-the-arts.

My impression of this model is that it seems to be product specific due to nature of the given reviews. Some products like electronics have a different pattern of reviews than books. This may limit the scalability of this model because we need separate models for each product categories. Furthermore, I think to personalize a user, their model uses review texts. It means that the model could summarize user’s sentiment or word usage. If we mix multiple product categories, the sentiment might mix up and it will be difficult to tell a user preference. And this is probably the main reason why the author separate different product categories and train separate models.

References:

[1] Qingyao Ai, Youngfeng Zhang, Keping Bi, Xu Chen, W. Bruce Croft. 2017. Learning a Hierarchical Embedding Model for Personalized Product Search. In Proceedings of SIGIR’17, Shinjuku, Tokyo, Japan, August 7-11, 2017, 10 pages.

[2] Tomax Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arxiv prepreint arXiv:1301.3781.

# Learning to Reweight Terms with Distributed Representations (SIGIR’15)

The retrieval model concerns with computing a relevant score between a query and document. Typically, the model is a weighted sum of matching score of term t and document D:

$s(q, D) = \sum_{t \in q \cap D} w(t)f(t, D)$

The most common weight function is an inverse document frequency (IDF). It signifies the unique terms over common terms. IDF is an appropriate assumption in any relevant model.

This paper proposes a term weight learning method based on distributed word vectors (Word2Vec). Motivated by additive property of word distributed representation, query q can be summarized as an average of all term in query q. The author believes that distributed representation will provide a more accurate estimation of query-dependent weight function $w(t)$.

This idea sounds better than IDF because the new weight function depends on a query. The weight will not be fixed for the given term t, but depends on the context of query q.

The paper uses a simple estimation based on the assumption that the average of word vectors will summarize the whole query. Then, any term in query q that deviates too much from the average should be less relevant. For each term t in query q and its corresponding word vector $\textbf{w}_{i,j}$, the query-dependent feature vector of term t is:

$\textbf{x}_{i,j} = \textbf{w}_{i,j} - \bar{\textbf{w}}_{q_i}$

Where $latex \bar{\textbf{w}}_{q_i}$ is an average of all word vectors in the query q. Then, in order to calculate the true weight, the author trains a linear regression with L1-regularizer using $\textbf{x}_{i,j}$ as an input feature and relevance judgment score based on term recall weight [2] as the ground truth. Once the mapping function is trained, for any given new query, the feature vector $\textbf{x}$ will be constructed and uses a mapping function to calculate the weight score.

Although the idea proposed in this paper is simple, it shows how to exploit the additive property of distributed representation. However, the way to calculate the query-dependent feature vector does not satisfy me. Some queries may contain multiple similar words based on syntactic similarities such as “United States” or “San Francisco”, then the average of the query will be biased toward the most common concept in the query. This may weaken the rare term in the same query in which that term may be the best indicator of the user’s intent.

References:

[1] Zheng, Guoqing, and Jamie Callan. “Learning to reweight terms with distributed representations.” SIGIR’15

[2] L Zhao and J. Callan. “Term necessity prediction.” In CIKM’10.

# Learning Deep Structured Semantic Models for Web Search using Clickthrough Data (CIKM’13)

When we compute the relevant score between a query and documents in the corpus, we want the higher score when the given query is relevant to the document. The vector-space model based on word matching may fail sometimes when both query and document do not share any common word. Hence, the latent semantic model could solve the vocabulary gap problem by giving a higher probability for a word that is semantically similar to words appears in the document.

Since this paper was published in 2013, the latent semantic model actually does not perform well compared to a simple heuristic vector-space model such as BM25. The actual word matching remains the most important indicator of relevancy.

Anyhow, this paper demonstrates that they can use a deep neural network to embed both query and documents to a semantic space. They assume that the relevant query and documents should be nearby in a semantic space. Although this assumption is valid, this model does not emphasize on true matching, which is very valuable information in my opinion.

The deep model architecture containing 3 main parts: word hashing layer, mapping layer which maps a document to semantic space, and relevance measurement layer which computes a cosine similarity between the given query and document. Then, the relevant score is computed through the softmax layer, normalized all cosine similarity.

The performance gain comes from the fact that the authors use supervised information (click-through data) to train the model. By maximizing the conditional probability of $\log \prod_{(Q,D^+)} P(D^+|Q)$, the model will learn to map a relevant pair of query and documents to similar embedded vectors.

The word hashing layer is introduced in order to reduce the dimension of an input vector. If we use one-hot vector, the length of a vector is too long. Using a letter-trigram reduces the dimension significantly.

The obvious extension is to replace the transformation layer with CNNs [2] or RNNs so that the deep model can capture local structures in the documents. [2] shows that using a convolutional layer with max pooling will slightly improve the NDCG score. I don’t know if anyone has tried RNN yet.

But what I don’t feel comfortable with their model is that they treat a query as a document. I think this assumption is too rigid. To me, a query is more like a signal indicates a user information need, which is clearly not a document. I still prefer to treat a query based on a language model rather than a vector space model.

References:

[1] Huang, P., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. 2013. Learning deep structured semantic models for web search using clickthrough data. In CIKM’13.

[2] Shen, Yelong, et al. “Learning semantic representations using convolutional neural networks for web search.” WWW’14

# Relevance-based Word Embedding (SIGIR’17)

Word embedding has become a standard tool in many IR tasks. The popular tools such as Word2Vec[2] and GloVe [3] map each word to an embedding space where similar words are nearby. The similarity measurement is based on semantic or syntactic between words. Typically, similar words tend to stay in the same context. Thus, the objective functions exploit this assumption such as CBOW or skip-gram models. As a result, the models are very good at guessing another word given the context.

However, the authors believe that semantic and syntactic similarities are not an appropriate metric in IR tasks such as query expansion and classification. In the query expansion, we want to suggest an additional term that is relevant to the query but not necessary semantically or syntactically similar to. For instance, the given query is “indian american museum”, the Word2Vec will suggest words such as “united, states” because “united” and “states” are semantically similar to the word “american”. But these are not good term expansions. The relevent-based embedding will suggest the words that are relevant to the ‘whole’ query such as “heye, chumash, apa” which are more directly related to the query terms. Based on this idea, it motivates the authors to develop a relavance-based word embedding.

This paper proposes two models: Relevance Likelihood Maximization (RLM) and Relevance Posterior Estimation (RPE). The first model, RLM, is consists of two distributions:

• $p(w|R_i)$ – the relevance model distribution for the given query $q_i$ where $p(w|R_i) = \sum_{d \in F_i} p(w|d) \prod_{w' \in q_i} p(w'|d)$. Basically, this distribution is a product of ‘how likely word w is relevant to all relevant documents in the top K documents’ and ‘how similar between query $q_i$ to each document in the top K’. The first term acts as likelihood function and the second term acts as a weight.
• $\hat p(w|q_i; \theta_R)$ – the similarity between embedded word w and query q. $\hat p(w|q_i; \theta_R) = \frac{\exp{\vec{w}^T\vec{q}}}{\sum_{w'\in V}\exp(\vec{w}'^T\vec{q})}$. We normalized it so that the distribution will sum to a unity.

To interpret this model, we basically use the relevance model distribution as a weight to signify the similarity between a particular query and word. This weight is computed via a traditional PRF model. (see eq.3). The neural network will be constraint by this given weight and will be able to find an embedding vector that hopefully maximize the likelihood function.

The second model, RPE, uses a mixture model by assuming that there are two language models for the feedback sets: the relevance language model and noisy language model. The author opts to cast this assumption to a binary classification problem by predicting if the given w does come from the relevance distribution of the query q. In other words, RPE wants to estimate the following distribution:

$\hat p(R=1|\vec{w},\vec{q};\theta_R) = \frac{1}{1+\exp{(-\vec{w}^T\vec{q})}}$

When a random variable R is one, it implies that word w in relevant to query q.

Two optimizations are necessary to train both RLM and RPE models. In RLM model, computing a normalization term is expensive, the author uses a hierarchical approximation of the softmax function [4] to speed up the training. Meanwhile, training RPE requires negative samples. The author chose the NCE [5] method.

The experimental results show that RLM model has a better performance than RPE models on a query expansion task. Both models outperform Word2Vec and GloVe due to its ability to put a high probability on words that are relevant to the entire query. However, RPE is better than RLM model on query classification task.

References:

[1] Zamani, Hamed, and W. Bruce Croft. “Relevance-based Word Embedding.” SIGIR’17

[2] Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” NIPS’13.

[3] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. “Glove: Global Vectors for Word Representation.” In EMNLP’14.

[4] Frederic Morin and Yoshua Bengio. 2005. Hierarchical Probabilistic Neural Network Language Model. In AISTATS’05.

[5] Michael U. Gutmann and Aapo Hyvarinen. 2012. Noise-contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. In Mach. Learn. Res. 2012.

# Learning to Match Using Local and Distributed Representations of Text for Web Search (WWW’17)

The idea of this paper is to combine a local matching with distributed learning. In a local matching, a relevance is determined by the exact match between a query and a candidate document. This approach is commonly used in a traditional IR and it has a key advantage such as it is required no training information, scalable, and it is not domain specific. Moreover, the matching model is important to find a new document. If the match term is unique, then we should be able to locate the new document right away.

The distributed learning approach embeds both query and document to a semantic space and computes similarity in that space. The traditional approach such as LSI, PLSA and LDA learn document topic distribution as a continuous vector and the most relevant document is the one that is closest to the embedded query. The distributed representation is important because it can capture semantical similarity where the exact match is unable to do so.

Combining these two models are not a new idea. Wei [2] uses LDA to augment a traditional retrieval model and demonstrates a performance gain. This paper combined two deep neural networks models: one for local matching, another for distributed learning by jointly learning both model parameters.

This paper explains the model architecture, which is heavily engineered, with convolutional and pooling layers, sophisticated input. The Hadamard product layer is used for computing exact match on embedded documents and query.

I feel this paper has a good motivation of why combing local matching with distributed representation might be a good idea. However, the motivation of choosing specific model architecture is not clear. This made me doubt if the performance does really come from combining two approaches or simply the choice of architecture. Furthermore, the local matching wants to capture the matching positions, but in order to do so, all documents must have the same length. Thus, the author picked the first 100 or 1000 words as the document representation. This means that we discard a lot of information.

I am not sure why the author chose to use character n-graph as an input feature. Why he did not use a simple term frequency or raw documents? The missing explanation has weakened this paper. However, the figure 6, the author embedded each retrieval performance vector onto 2D space is interesting. This simple visualization can be used to compare model similarity.

References:

[1] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match Using Local and Distributed Representations of Text for Web Search. In WWW’17. 1291-1299.

[2] X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR’06. 178-185.

# Paper Reading: Semantic Hashing using Tags and Topic Modeling (SIGIR13)

This paper is similar to collaborative topic modeling. They also mentioned that in their paper. The main difference is they train LDA offline so the topic distribution is not affected by tagging (supervisory signals).

From probabilistic perspective, it is variant of CTM:

Train topic model offline.

1. For each doc y_j in Corpus:
1. Draw latent offset, e ~ Gaussian(0, 1)
2. Draw document latent, y ~ e + topic(y)
2. For each tag u_i:
1. Draw tag, u_i, from Gaussian(0, 1)
3. For each doc y_j and tag u_i:
1. Draw a document label, t_{i,j} ~ Gaussian( u_i * y_j , confidence)

I think the confidence matrix that they create is simply a fixed variance.

From matrix factorization perspective:
1. They factor a tagging matrix: T = U * Y
2. Add regularizer || U ||^2 to keep U small since U is drawn from Gaussian(0,1).
3. Add regularizer that penalizes when Y is deviate from a topic(Y).

Their binarization from Y to hashcode is very straight forward.

Based on my analysis, there are a lot of rooms to improve:
1. adding non-linearity to matrix factorization, hash code function, and binarization function.
2. tagging uncertainty is learned from the data (instead of fixed variances).
3. tags (supervisory signal) can be correlated ( so document without tag t, can still have a high probability for tag t, if similar tag s appears in the doc.
4. jointly learn a topic model.
References

References:

[1] Wang, Qifan, Dan Zhang, and Luo Si. “Semantic hashing using tags and topic modeling.” Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013.

# Supervised Hashing for Image Retrieval

This paper proposed a supervised hashing method that learns hash codes from the images and a hash function using convolution neural network (CNN).

Ideally, it would be nice if the model can simultaneously learn a hash function from the train images and class labels. If they only set up CNN by learning a function that maps a raw image to class labels, then they won’t learn any good representation but instead learning a discriminative model.

Instead, they need to approximate hash codes for training data from the similarity matrix. Defining a matrix $S_{i,j} = +1$ if image i is semantically similar to image j. Otherwise, $S_{i,j} = -1$. Then they cleverly assume that a dot product between two hash code (binary vector) is approximately equivalent to +q if both vectors are similar. q is a dimension of a binary vector. The objective function then becomes a matrix factorization problem:

$\min_{H} || S - \frac{1}{q}HH^T||_F^2$ where $H \in \{-1, 1 \}^{n x q}$

It is very difficult to efficiently optimize the above objective function due to an integer constraint. The authors relax this restriction by allowing a range constraint, $H \in [-1, 1 ]^{n x q}$, which is easier to optimize using a coordinate descent algorithm using Newton direction. They contribution is the derivation of efficient optimization algorithms that learns an approximate hashcode from the range constraint.

Once hash codes from train data are obtained, the next step is to determine a hash function. They construct CNN and simultaneously learns a function that maps a raw image to an image representation (encoder) and a hash function that maps this representation to the given approximate hash codes. The author called this as the CNNH model.

By adding label information as additional output layers, now CNN needs to learn an image representation that matches the given hash codes and label matrix. This additional constraint forces the CNN to learn a shared image representation. The author called this as CNNH+ model.

The CNNH+ model has the highest MAP on MNIST, CIFAR10, and NUS_WIDE datasets.

I think using deep learning architecture is trivial but learning an approximate hash code is much more challenging when the objective function is a range constraint. The authors demonstrated that the proposed optimization method is faster and more scalable than BCD algorithm. I believe BCD [2] is the state-of-the-arts method at that time.

References:

[1] Xia, Rongkai, et al. “Supervised Hashing for Image Retrieval via Image Representation Learning.” AAAI. Vol. 1. 2014.

[2] Lin, Guosheng, et al. “A general two-step approach to learning-based hashing.” Proceedings of the IEEE International Conference on Computer Vision. 2013.