# Node2Vec: Scalable feature learning for networks (KDD’16)

## Intro

This paper proposes a generalization of DeepWalk [2] and Line [3].

## Key Ideas

The key observation is that the role of the vertex is also important. For example, given two vertices that are far apart from each other but share similar kind of vertices. Then, they both have similar roles (e.g. hubs or bridges). Breadth-First-Search traversal can capture this graph structure. On the other hands, the community is described as reachability/closeness of the two nodes in the graph. For instance, in the social network, my friend’s friend’s friend has a higher chance to belong to the same community as me.

## Similar Works

DeepWalk uses a random walk to create a set of vertices that represents a context around the vertex of interest. The path generated by the random walk will either lead to the very faraway node or nodes around the seed. The context that is captured by this process can be unpredictable and depends on the graph structure.

Line focuses on neighbor vertices, which is the same as Breadth-First-Search traversal (BFS). In this case, it captures the local community in the graph.

## Node2Vec

It comes down to what is a proper way to define the walk so that we can capture both community structure and role-based structure. Node2Vec defines a general method of graph traversal that is controlled by 2 parameters, p and q.

The key difference between BFS and DFS samplings is that the BFS is better at exploring the local neighborhoods while DFS is good at exploring larger parts of the network. BFS is viewed as micro-view of the graph whereas DFS characterizes the macro-view of the graph. The authors believe that the mixture of these two classic sampling will improve the representation of the graph embedding.

## Search bias $\alpha$

The unnormalized transitional probability from node v to x is:

$\alpha_{p,q}(t, x) = \frac{1}{p}$ if $d_{tx} = 0$

$\alpha_{p,q}(t, x) = 1$ if $d_{tx} = 1$

$\alpha_{p,q}(t, x) = \frac{1}{q}$ if $d_{tx} = 2$

Where t is the previously visited node, v is the current node, and x is the next node. The distance $d_{tx}$ determines the type of visiting nodes. When the distance between the previous node t and the next node x is zero, it means that we return to the node t.

However, when the distance is 1, it means that we want to visit the node that is directly connected to current node v and the previous node t. It means that node x is shared node between v and t. Hence, this walk will capture the structure of the network (local view).

Lastly, when the distance is 2, we want to hop further away from node t. This is similar to DFS where we want to go deeper into the network graph.

Parameter p and q will control the characteristic of bias walking. The high value of p means that we don’t want to go back often. The high value of q means that we do not want to make too many hops. Hence, p and q control the balance between BFS and DFS sampling.

## Closing

This paper generalizes the random walk by adding parameters to control the walk characteristic. I think this is a neat idea because some information networks may need a specific walk than a general random walk. Thus, this model allows us to define the context based on how much we want to explore the network versus how much we want to exploit the local structure.

## References:

[1] Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016.

https://arxiv.org/pdf/1607.00653.pdf

[2] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

[3] Tang, Jian, et al. “Line: Large-scale information network embedding.” Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015.

# LINE: Large-scale Information Network Embedding (WWW’15)

## Introduction

This paper [1] proposes the method of embedding a large graph structure into low-dimensional space. In contrast to DeepWalk [2], LINE utilizes both first-order (direct edge) and second order proximity (share similar neighbors). The contribution is to use 2nd order proximity to preserve the global structure of the graph. Another benefit of this approach is to increase the number of training samples because information network (graph) can be sparse ( small number of edges ).

## Contribution

The contribution is to use 2nd order proximity to preserve the global structure of the graph. Another benefit of this approach is to increase the number of training samples because information network (graph) can be sparse ( small number of edges ).

Another key contribution is their optimization procedure. An edge sampling method stabilizes the training through stochastic gradient descent. Without this method, the gradient can be exploded and has a high variance which degrades the performance.

## Second-order Proximity

The key observation is that the first-order proximity alone is not sufficient to preserve the network structure due to the small number of observed links/connections between vertices. Hence, observing the neighbor vertices can be helpful. The second proximity between vertex u and v is defined as the similarity between the neighbor vertices of u and v.

## First-order Model

The objective function to preserve the first-order proximity is to encourage the model to find embedding for vertex $v_i$ and $v_j$ such that their embedding $u_i$ and $u_j$ are similar. The model attempts to maximize the following joint probability:

$p_1(v_i, v_j) = \frac{1}{1 + \exp(-\textbf{u}_i^T\textbf{u}_j)}$

Then, they want to make sure that the joint probability $p_1$ is close to the empirical probability $\hat p_1(i, j) = \frac{w_{ij}}{W}$ where $W = \sum_{(i,j) \in E} w_{ij}$.

The objective function is:

$O_1 = - \sum_{(i,j) \in E} w_{ij} \log p_1(v_i, v_j)$

## Second-order Model

This model assumes that if the two vertices share many common vertices, then they should be similar. These set of sharing vertices is treated as a context. The authors introduce context embedding, basically, each vertex will now additional embedding vector which is a context embedding, $u'_i$. I think the real motivation is to force any vertex embedding that sharing similar context to become closer. It implicitly forces similar embedding vectors between two similar vertices.

To measure the similarity between vertex $v_i$ and its context vertex $v_j$, we define the conditional distribution over the given vertex $v_i$ as:

$p_2(v_j|v_i) = \frac{\exp({\textbf{u}'_j}^T\textbf{u}_i)}{\sum_{k=1}^{|V|} \exp({\textbf{u}'_k}^T\textbf{u}_i)}$

Then, they also want to the conditional distribution to be similar to the empirical distribution $\hat p_2(v_j|v_i) = \frac{w_{ij}}{d_i}$ where $w_{ij}$ is a weight of edge (i, j) and $d_i$ is the out-degree of vertex i.

The objective function is defined as:

$O_2 = - \sum_{(i, j) \in E} w_{ij} \log p_2(v_j|v_i)$

## Combining Model

The authors simply concatenate the embedding vectors learned from the first model and second model. A jointly train the objective function is left for the future work.

## Edge Sampling

The experimental results show that a straightforward optimization using SGD suffers from the high variance problem. Thus, edge sampling is an important strategy to get the best performance.

The main problem is that each edge has different weight. In order to force a binary weight, they sample the edges according to the weights. The sampling strategy is actually simple. Since each edge has different weight, the probability of choosing the edge (i, j) is a ratio of weight (i, j) over the sum of all weight. Since this sampling strategy is computationally expensive, they use the alias table method [3] to speed up the sampling.

## References:

[1] Tang, Jian, et al. “Line: Large-scale information network embedding.” Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015.

[2] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

My post on DeepWalk: DeepWalk: Online Learning of Social Representation (KDD’14)

[3] Li, Aaron Q., et al. “Reducing the sampling complexity of topic models.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

# DeepWalk: Online Learning of Social Representation (KDD’14)

DeepWalk is a novel approach for learning a latent representation of vertices in a network. The problem is summarized as we are given a graph containing edges and vertices, we want to embed each vertex into an embedding space.

## How does DeepWalk work?

The idea behind this model is that vertices that are near each other should have similar latent vectors. So it really comes down to how to define a set of similar vertices. Interestingly, there is a connection between graph and natural language. The authors show that the distribution of words in natural language, as well as distribution of vertices appearing in short random walks, follow a power-law.

This means that a random walk starting at any random vertex is the same as modeling a symbol frequency. Hence, we can use a language model to estimate the likelihood of vertex appearing the given random walk.

Basically, DeepWalk model wants to learn a mapping function $\Phi: v \in V \rightarrow \mathcal{R}^{|V|\times d}$ that will maximize the following likelihood function:

$P(v_i | \Phi(v_1), \Phi(v_2), \cdots, \Phi(v_{i-1})$

But the above likelihood function becomes more expensive to compute as the length of the walk grows. Hence, they use similar relaxations found in word2vec including word orderless, fixed window size, and skip-gram.

Now, the likelihood has changed to:

$P(v_{i-w}, \cdots, v_{i-1}, v_{i+1}, \cdots, v_{i+w}|\Phi(v_i)) (1)$

This model is the same the Skip-gram model in word2vec. The author uses a hierarchical softmax to model equation 1.

To train the model, the authors will go through each vertex in the graph, perform a random walk of length t, then optimize the objective function.

This is an interesting paper and shows the connection between graph structure and word embedding via the local context.

## References:

[1] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

https://arxiv.org/pdf/1403.6652.pdf

This work extends Neural Autoregressive Distribution Estimation (NADE) for a document modeling.

The key idea of NADE is each hidden and output vectors are modeled as a conditional probability of previously seen vectors:

$p(\textbf{v}) = \prod_{i=1}^D p(v_i | \textbf{v}_{

$\textbf{h}( \textbf{v}_{

Then, the probability of the output is:

$p(v_i=1|\textbf{v}_{

NADE has a set of separated hidden layers, each represents the previously seen context. However, NADE is not applicable for a variable length input such as a sequence of words.

DocNADE model tackles a variable length input issue by computing the hidden vector as follows:

$\textbf{h}( \textbf{v}_{

Each word $v_k$ is an index in the vocabulary of fixed length. Each column of matrix W is a word embedding. Hence, a summation of word vectors represents a previous word context. This does not preserve the word order since the model simply sums all word vectors.

The output layer requires a softmax function to compute the word probability. A hierarchy softmax is necessary to scale up this calculation.

The previous model may not suitable for language model because it focuses on learning a semantic representation of the document. The hidden layer now needs to pay more attention to the previous terms. It can be accomplished by using n-gram model:

$\textbf{h}_i(\textbf{v}_{

The additional hidden unit $latex \textbf{h}_i^{LM}$ models a n-gram language model:

$latex \textbf{h}_i^{LM}(\text{v}_{<i}) = \sum_{k=1}^{n-1}U_k \dot W_{:,v_{i-k}}^{LM}$

The matrix $W^{LM}$ is a word embedding based on n-gram model.

## Summary

DocNADE is similar to Recurrent Neural Network model where both models estimate the conditional probability of the current input given the previous input. For language modeling task, RNN is less explicit on how much word or context to look back. But DocNADE requires us to explicitly tell the model the number of words to look back. On the other hand, DocNADE has a similar favor to Word2Vec where the document representation is simply an aggregate of all previously seen words. However, DocNADE adds additional transformation on top of hidden units.

Will this type of Autoregressive model fall out of fashion due to the success of Recurrent Network with Attention mechanism and memory model? The current trend suggests that RNN is more flexible and extensible than NADE. Hence, there will be more development and extension of RNN models more and more in the coming year.

References:

Lauly, Stanislas, et al. “Document neural autoregressive distribution estimation.” arXiv preprint arXiv:1603.05962 (2016).

# A Dual Embedding Space Model for Document Ranking (WWW’16)

This paper uses word embedding learned from word2vec model to improve the text retrieval task.

## In and out Matrices in Word2Vec

Word2Vec boils down to learn two sets of word embedding matrices: $W_{in}$ and $W_{out}$. The objective function is to maximize the similarity between the source word $w_s \in W_{in}$ and the target word $w_t \in W_{out}$:

$w_s^Tw_t$ should be high if both words appear in the same context window.

The relationship among word vectors in $W_{in}$ is different from $W_{out}$. Each similar word vectors in $W_{in}$ have common semantic. For an instance, Cambridge and Oxford are both well-known university in the UK. On the other hands, The similar word vectors in $W_{out}$ share the common context. The word such as highlights, jersey, and stadium are all related to sport. This paper exploits this fact for the retrieval task.

The authors postulate that document’s aboutness is a key ingredient of retrieving more relevant documents.  For example, when a query word contains word ‘jersey’, the retrieved documents that contain the related words to sports context should probably be more relevant than the document that simply has synonym words such as ‘t-shirt’ or ‘cloths’. Without context information, the word such as ‘t-shirt’ can appear also in shopping context as well as sports context.

The goal is to find a relevant document that matches the information need. The vector space model such as BM25 scores a higher point to those documents with an exact keyword match. Although keyword matching is very accurate, it has a few flaws. It is not robust to the keyword stuffing problem where some spam website loaded with keywords that are irrelevant to the website. The vector space model cannot detect this problem.

Hence, the relevant score should reflect how well the query matches the context of the document. If the query keyword is ‘Cambridge’, the relevant context should contain words related to universities such as faculty, students, or campus. This intuition motivates the authors to use word embedding from IN matrix ($W_{in}$) for a query and word embedding from OUT matrix ($W_{out}$) for documents.

## Document Representation

The document representation is a centroid (an average) of all word vectors in the documents. This area could be improved because not all words represent the document equally.

## Relevant Score

Finally, to compute a relevant score between query and document, the author uses a cosine similarity between a query vector and document centroid. Furthermore, the exact matching is still important. The final model averages the score between their proposed model and BM25. This simple change yields the best NDCG score.

This paper made me re-think about the key difference between local context used in word2vec and global context used in a topic model. I hope by utilizing both information (local and global) will help us find an even more relevant documents.

References:

https://arxiv.org/pdf/1602.01137.pdf

# LDA2Vec

This paper proposed an extension of word2vec by adding document and topic vectors inspired by LDA (Latent Dirichlet Allocation). The model can discover a linear relationship between words. For example, “California + technology = Silicon Valley”. The topic is interpretable by collecting all nearby word vectors to the selected topic vector. This work boosts word2vec with topic modeling via training in the similar fashion to word2vec.

The key difference of LDA2Vec is its loss function. There are 2 loss functions: the first one is Skipgram Negative Sampling Loss which is similar to Word2Vec. It wants to maximize the probability of predicting a target word $\vec w_j$ and non-target word (negative samples) given a context vector $\vec c_j$. This loss wants the model to distinguish a positive word (which related to the given context) from negative sampled words.

The innovation is the context vector $\vec c_j$. The intuition is that predict a nearby word given a pivot word also depends on the theme of the context. For example, if the document is about an airline when we want to predict nearby words given a word “Germany”, we will likely want to see word related to airlines but not country names. Thus, a context vector is a sum of a word vector and document vector. $\vec c_j = \vec w_j + \vec d_j$. Thus, LDA2vec attempts to capture both document-wide relationship and local interaction between words within its context window.

In order to learn a topic vector, the document is further decomposed as a linear combination of topic vectors. $\vec d_j = \sum_{k} p_{jk} \cdot \vec t_k$ where $p_{jk}$ is a probability of document j will be a topic k. Finally, the interpretability comes from a sparsity of topic assignment vector, $p_j$. One way to enforce sparsity is to design the loss function as:

$L^{d} = \lambda \sum_{jk} (\alpha - 1)\log p_{jk}$

When $\alpha < 1$, we encourage a topic assignment probability to put more mass on a small set of topics.

The results look interesting. This paper shows a simple way to combine topic modeling with word embedding. By embedding document vectors and topic vectors into the same semantic space as word vectors, we can learn a global semantic structure as well as word-level local interaction.

References:

[3] Skip-gram tutorial part1: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

[4] Skip-gram tutorial part2: http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

# Learning a Hierarchical Embedding Model for Personalized Product Search (SIGIR’17)

This paper proposed a personalized product search model. Given a user and query, the model needs to retrieve a relevant product that a user will likely to purchase. Due to this requirement, the relevancy is based on the likelihood of the user will buy retrieved products. This can be finding a product that meets a user’s intent or need.

The model uses product reviews written by users to model a user’s preference. The authors assume that a query is independent (not personalized). It means by observing a query alone, we can’t tell which user creates that query. This assumption is necessary because the dataset does not have queries and the authors need to construct a set of queries based on the product’s description. On the other hands, The item embedding is also based on its product reviews. By looking all reviews, the model should be able to summarize the item and be able to learn its embedding vector.

The last information is a query. The model also learns word embedding for queries and then construct a query based on either averaging all words, non-linear projecting word embedding or simply use RNN to get a final embedding vector. It means that this model will learn word embedding for all words observed in the pseudo-queries, word embedding for all users and items.

But they need to relate query, user, and item. And I think this is the key assumption of this model. They assume that a linear combination of user and query should be similar to a relevant item. Hence, the model assumes knowing user and query, we should have a good idea what should be relevant products/items. As a result, the objective function is to maximize the log likelihood of the language model of all users, the language model of all items, and the item generation likelihood.

The neural networks are utilized in their model. To construct a query vector, projecting an average word embedding yields the best accuracy. However, this model requires approximating a lot of language models. This could be challenging to train. Thus, the author approximates the softmax calculation based on negative sampling technique proposed by [2]. The experiment results seem valid but this model performs poorly on some product searches. But this model beats the state-of-the-arts.

My impression of this model is that it seems to be product specific due to nature of the given reviews. Some products like electronics have a different pattern of reviews than books. This may limit the scalability of this model because we need separate models for each product categories. Furthermore, I think to personalize a user, their model uses review texts. It means that the model could summarize user’s sentiment or word usage. If we mix multiple product categories, the sentiment might mix up and it will be difficult to tell a user preference. And this is probably the main reason why the author separate different product categories and train separate models.

References:

[1] Qingyao Ai, Youngfeng Zhang, Keping Bi, Xu Chen, W. Bruce Croft. 2017. Learning a Hierarchical Embedding Model for Personalized Product Search. In Proceedings of SIGIR’17, Shinjuku, Tokyo, Japan, August 7-11, 2017, 10 pages.

[2] Tomax Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arxiv prepreint arXiv:1301.3781.