# Embedding-based Query Language Models (ICTIR’16)

This paper proposes a more accurate query language model for query expansion task based on word embedding.

The key contribution is their observation that the cosine similarity score between the top words with the query word is not too distant from the similarity score between the 1000th word with the same query. They use a sigmoid function to amplify the similarity score.

The rest of the paper proposes two query language models.

### The first model

The first model assumes the conditional independence between the query terms.

The model is:

$p(w|\theta_q) \propto p(\theta_q|w)p(w) = p(w)\prod_{i=1}^kp(q_i|w)$

It defines: $p(q_i|w) = \frac{\delta(q_i,w)}{\sum_{w'\in V} \delta(w',w)}$

And $p(w) = \sum_{w' \in V} p(w, w') \propto \sum_{w' \in V} \delta(w, w')$

### The second model

$p(w|\theta_q) = \sum_{w'\in V} p(w,w'|\theta_q)p(w'|\theta_q)$

This model now assumes that query and term similarity is independent:

$latex p(w,w’|\theta_q) = p(w|w’) = \frac{\delta(w,w’)}{\sum_{w”\in V} \delta(w”, w’)}$

and the second term uses MLE to estimate the probability:

$p(w'|\theta_q) = \frac{C(w'|Q)}{|Q|}$

The paper also talked about the embedded-based relevance model. The existing model by the same author is now extent with the proposed embedded query language model.

References:

[1] Zamani, Hamed, and W. Bruce Croft. “Embedding-based query language models.” Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval.

https://ciir-publications.cs.umass.edu/getpdf.php?id=1225

# Pseudo-Relevance Feedback based on Matrix Factorization (CIKM’16)

This work uses pseudo-relevance feedback collections to recommend top words and their weights to the input query.

It starts with creating a document-word matrix where each row each retrieved documents and each column is term weight. The first row is the input query’s term weights. The rest of documents either come from relevant feedback or top-K retrieved documents (pseudo-relevant feedback).

Then, we perform a non-negative matrix factorization (NMF) on the matrix. The first row of the reconstructed matrix is a re-estimate query and its top m terms will be used query expansion terms.

However, this model needs to re-compute NMF for every query, this can be expensive operations. Thus, the paper recommends creating a small document-word matrix (e.g use only top 10 retrieved documents).

What makes this model stands out is that the model not only considers the words that discriminate the top documents from the collection, it only considers the relevance of those words to the original query.

I am wondered if calculate the NMF on such a small matrix will really work. The empirical results demonstrate that this idea works but its performance is not impressive. But I think using NMF for a query expansion task is a neat idea.

References:

https://ciir-publications.cs.umass.edu/getpdf.php?id=1224

# A Dual Embedding Space Model for Document Ranking (WWW’16)

This paper uses word embedding learned from word2vec model to improve the text retrieval task.

## In and out Matrices in Word2Vec

Word2Vec boils down to learn two sets of word embedding matrices: $W_{in}$ and $W_{out}$. The objective function is to maximize the similarity between the source word $w_s \in W_{in}$ and the target word $w_t \in W_{out}$:

$w_s^Tw_t$ should be high if both words appear in the same context window.

The relationship among word vectors in $W_{in}$ is different from $W_{out}$. Each similar word vectors in $W_{in}$ have common semantic. For an instance, Cambridge and Oxford are both well-known university in the UK. On the other hands, The similar word vectors in $W_{out}$ share the common context. The word such as highlights, jersey, and stadium are all related to sport. This paper exploits this fact for the retrieval task.

The authors postulate that document’s aboutness is a key ingredient of retrieving more relevant documents.  For example, when a query word contains word ‘jersey’, the retrieved documents that contain the related words to sports context should probably be more relevant than the document that simply has synonym words such as ‘t-shirt’ or ‘cloths’. Without context information, the word such as ‘t-shirt’ can appear also in shopping context as well as sports context.

The goal is to find a relevant document that matches the information need. The vector space model such as BM25 scores a higher point to those documents with an exact keyword match. Although keyword matching is very accurate, it has a few flaws. It is not robust to the keyword stuffing problem where some spam website loaded with keywords that are irrelevant to the website. The vector space model cannot detect this problem.

Hence, the relevant score should reflect how well the query matches the context of the document. If the query keyword is ‘Cambridge’, the relevant context should contain words related to universities such as faculty, students, or campus. This intuition motivates the authors to use word embedding from IN matrix ($W_{in}$) for a query and word embedding from OUT matrix ($W_{out}$) for documents.

## Document Representation

The document representation is a centroid (an average) of all word vectors in the documents. This area could be improved because not all words represent the document equally.

## Relevant Score

Finally, to compute a relevant score between query and document, the author uses a cosine similarity between a query vector and document centroid. Furthermore, the exact matching is still important. The final model averages the score between their proposed model and BM25. This simple change yields the best NDCG score.

This paper made me re-think about the key difference between local context used in word2vec and global context used in a topic model. I hope by utilizing both information (local and global) will help us find an even more relevant documents.

References:

https://arxiv.org/pdf/1602.01137.pdf

# Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention (SIGIR’17)

This paper proposed a content-based recommendation model using attention mechanism. The task is to recommend video or image to a user based on user’s interaction with the content such as view counts or clicks. This is an implicit feedback problem.

The key motivation is that in the video or image recommendation, users do not like the entire video or images but only pay attention to the subset of the content. For the video, only some frames that users like. For an image, only some part of the images are interesting to the user. The attention mechasim

The model resembles SVD++ because the overall rating has both latent factor model and neighborhood model:

$\hat R_{ij} = (\textbf{u}_i + \sum_{l \in R(i)} \alpha(i, l)\textbf{p}_l)^T \textbf{v}_j \\ =\textbf{u}_i^T \textbf{v}_j + \sum_{l \in R(i)} \alpha(i, l)\textbf{p}_l^T \textbf{v}_j$

The first term is a standard latent factor model: finding an embedding vector for user and item. The second term is a neighborhood model: the more similar between item j and some items that user i likes, the higher rating prediction.

The difference is: SVD++ uses a uniform weight but this model learns weight from the data. Furthermore, the neighborhood model does not compute a dot product between two embedding item vectors but between item vector and auxiliary item vector. This vector is used to characterize users based on the set of items they interacted with.

The model architecture basically has 3 parts. The first part is to compute an item vector based on a weight combination of all components. E.g. find a vector representation for a video that has 100 frames. This representation $\bar x = \sum_l \beta(l, m) x_{lm}$ is a personalized item representation because the weight (attention) $\beta(l,m)$ is computed from a nonlinear function $b(i, l, m)$ that takes a user and all item components.

The second part is to compute a neighborhood item vector, $\sum_{l \in R(i)} \alpha(i,l) \textbf{p}_l$. The attention is computed from a function $a(i, l)$ that takes the user i, item l, auxiliary tiem l, $\bar x_l$. Then, this neighborhood vector will be added to a user vector i.

The final part performs a pair-wise learning. This paper use BPR (Bayesian Personalized Ranking) to minimize the rank loss between positive and negative samples. The evaluation metric use leave-one-out which is to keep the user’s last interaction with an item as the test set. Then, they measure hit rate and NDCG.

The hit rate of the proposed model is better than SVD++. I think the author should include FISM as a baseline since this model also performs well on implicit feedback task. Furthermore, SVD++ does not perform well on Top-N task since it optimized for RMSE metrics. Hence, SVD++ is not a strong baseline.

Overall, what I like about this work is the hierarchy model that construct a personalized item representation based on an attention. This idea makes a lot of sense because (1) people focus on different part of the content. The item vector should then be personalized; (2)  This model has a lot of parameters so it works well on the large video and image dataset.

References:

http://www.comp.nus.edu.sg/~xiangnan/papers/sigir17-AttentiveCF.pdf

# Neural Collaborative Filtering (WWW’17)

This is another paper that applies deep neural network for collaborative filtering problem. Although there are a few outstanding deep learning models for CF problems such as CF-NADE and AutoRec, the author claims that those models are solving for explicit feedback and positioned this work to solve for ‘implicit feedback CF’ problem.

The model is straightforward and similar to Neural Factorization Machine. The idea is to find embedding vectors for users and items and model their interaction as a non-linear function via a multi-layers neural network. A non-linear interaction has been commonly used in many recent works such as Neural Factorization Machine [2], Deep Relevant Matching Model [3].

The authors proposed 3 models incrementally: (1) Generalized Matrix Factorization – which is basically MF with additional non-linear transformation. In this case, they use sigmoid function; (2) Multi-layer Perceptron – which concatenate user and item embedded vectors and transform them by a learned non-linear function; (3) Fusion model can be either shared the same embedding vectors and add them up at the last layers or learned separate embedding vectors and concatenate them at the output layer.

Since they want to predict either the given item is preferable or not, it is a binary classification problem. Then the loss function can be a binary cross-entropy. To me, implicit feedback seems to be an easier problem than rating prediction problem because we only need to make a binary prediction.

The baseline seems okay but I wish the authors include more recent deep learning models such as [4]. The AutoRec model is also applicable for an implicit feedback by forcing the output to be a binary output. Regardless of their baseline, the extensive experiments tried to convince readers that deep neural network can model a complex interaction and will improve the performance.

Speak of the performance, the author uses hit-ratio and NDCG. Basically, there is one test sample for each user. The model tries to give a high score for that sample. This metric is more practical than MSE and RMSE since the dataset is implicit feedback dataset.

Non-linear interaction is a simple extension to a traditional MF model. This author shows that this simple extension does work for MovieLens and Pinterest dataset.

Reference:

[1] He, Xiangnan, et al. “Neural collaborative filtering.” Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

http://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf

[2] Xiangnan He, Tat-Seng Chua, “Neural Factorization Machines for Sparse Predictive Analytics”, ACM SIGIR 2017.

[3] Guo, Jiafeng, et al. “A deep relevance matching model for ad-hoc retrieval.” Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2016.

[4] Zheng, Yin, et al. “Neural Autoregressive Collaborative Filtering for Implicit Feedback.” Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016. APA

# IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models (SIGIR’17)

This paper uses GANs framework to combine generative and discriminative information retrieval model. It shows a promising result on web search, item recommendations, and Q/A tasks.

Typically, many relevant models are classified into 2 types:

• Generative retrieval model
• It generates a document given a query and possibly relevant score. The model is $p(d|q, r)$.
• Discriminative retrieval model
• It computes a relevance score for the given query and document pair. The model is $p(r| d, q)$.

The generative model tried to find a connection between document and query. On the other hand, the discriminative model attempts to model the interaction between query and document based on relevance scores.

Both models have their shortcoming. Many generative models require a predefined data generating story. The wrong assumption will lead to the poor performance. The generative model is usually trying to fit the data to its model without external guidance. Meanwhile, the discriminative model requires a lot of labeled data to be effective, especially for a deep neural network model.

By train both models using GANs framework, it is now possible to solve their shortcoming. The generative model is now adaptive because the discriminator will reward the generative model when it can create or select good samples. This adaptive guidance from the discriminator is unique in GANs framework and will help the generator learns to pick good samples from the data distribution. At the same time, the discriminator can receive even more training data from the generative model. This is similar to semi-supervised learning where unlabeled data are utilized. Adversarial training allows us to improve both generative and discriminative models via jointly learning through the

Adversarial training allows us to improve both generative and discriminative models via jointly learning through the minimax training. The traditional training based on maximum likelihood does not have principle way to allow both models to give each other feedbacks.

The paper seems to be promising and their results on 3 information retrieval tasks are really good. But I notice that their training procedure requires pretraining. This made me wonder if pre-training is part of the performance boost during testing. I don’t find the part in the paper that explains the benefit of pretraining in their settings.

The discriminative model is straight forward. It is a sigmoid function. The discriminator basically gives a high probability when the given document-query pair is relevant. The generative model is more interesting. In the standard GANs, the generator will create a sample from a simple distribution, but IRGAN does not generate a new document-query pair. Instead, the author chose to let the generator select the sample from the document pool. In my opinion, this approach is simpler than creating a new data because the sample is realistic. Also, IRGAN cares about finding a function to compute a relevance score so it is unnecessary to generate a completely new data.

However, the cost function for the generator is an expectation over all documents in the corpus. The Monte Carlo approximation will have a high variance. Thus, they use policy gradient to reduce the variance so that the model can learn a useful representation. Although $p(d|q,r)$ is a discrete distribution,  the backprop is applicable because we pre-sample all documents from $p(d|q,r)$ beforehand. Thus, eq. 5 is differentiable. The extra care may need in order to reduce variance further. They use an advantage function. (Please look at the reference on Reinforcement Learning [2]).

Generating positive and negative samples are still confusing in this paper. It seems to be application specific. The author mentioned about using softmax with temperature hyper-parameter to put more or less focus on top documents. My guess is when we put less focus on top documents, the generator has more chance to pick up more negative samples. After I read the paper again, it seems that all samples selected by the generator model are negative samples. This part remains unclear and I need to ask the author for more details.

In conclusion, I like this paper because it tried to combine generative and discriminative retrieval models via GANs framework. I would not be appreciated if they simply applied the IR task to GANs blindly. Instead, they explained their motivation and discussed the advantage of jointly train both models. It seems adversarial training is useful for IR tasks as well.

References:

[1] Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models. In Proceedings of SIGIR’17, Shinjuku, Tokyo, Japan, August 7-11, 2017, 10 pages.

[2] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, and others. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NIPS.

[3] IRGAN code (http://lantaoyu.com/publications/IRGAN)

# Learning a Hierarchical Embedding Model for Personalized Product Search (SIGIR’17)

This paper proposed a personalized product search model. Given a user and query, the model needs to retrieve a relevant product that a user will likely to purchase. Due to this requirement, the relevancy is based on the likelihood of the user will buy retrieved products. This can be finding a product that meets a user’s intent or need.

The model uses product reviews written by users to model a user’s preference. The authors assume that a query is independent (not personalized). It means by observing a query alone, we can’t tell which user creates that query. This assumption is necessary because the dataset does not have queries and the authors need to construct a set of queries based on the product’s description. On the other hands, The item embedding is also based on its product reviews. By looking all reviews, the model should be able to summarize the item and be able to learn its embedding vector.

The last information is a query. The model also learns word embedding for queries and then construct a query based on either averaging all words, non-linear projecting word embedding or simply use RNN to get a final embedding vector. It means that this model will learn word embedding for all words observed in the pseudo-queries, word embedding for all users and items.

But they need to relate query, user, and item. And I think this is the key assumption of this model. They assume that a linear combination of user and query should be similar to a relevant item. Hence, the model assumes knowing user and query, we should have a good idea what should be relevant products/items. As a result, the objective function is to maximize the log likelihood of the language model of all users, the language model of all items, and the item generation likelihood.

The neural networks are utilized in their model. To construct a query vector, projecting an average word embedding yields the best accuracy. However, this model requires approximating a lot of language models. This could be challenging to train. Thus, the author approximates the softmax calculation based on negative sampling technique proposed by [2]. The experiment results seem valid but this model performs poorly on some product searches. But this model beats the state-of-the-arts.

My impression of this model is that it seems to be product specific due to nature of the given reviews. Some products like electronics have a different pattern of reviews than books. This may limit the scalability of this model because we need separate models for each product categories. Furthermore, I think to personalize a user, their model uses review texts. It means that the model could summarize user’s sentiment or word usage. If we mix multiple product categories, the sentiment might mix up and it will be difficult to tell a user preference. And this is probably the main reason why the author separate different product categories and train separate models.

References:

[1] Qingyao Ai, Youngfeng Zhang, Keping Bi, Xu Chen, W. Bruce Croft. 2017. Learning a Hierarchical Embedding Model for Personalized Product Search. In Proceedings of SIGIR’17, Shinjuku, Tokyo, Japan, August 7-11, 2017, 10 pages.

[2] Tomax Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arxiv prepreint arXiv:1301.3781.