# Node2Vec: Scalable feature learning for networks (KDD’16)

## Intro

This paper proposes a generalization of DeepWalk [2] and Line [3].

## Key Ideas

The key observation is that the role of the vertex is also important. For example, given two vertices that are far apart from each other but share similar kind of vertices. Then, they both have similar roles (e.g. hubs or bridges). Breadth-First-Search traversal can capture this graph structure. On the other hands, the community is described as reachability/closeness of the two nodes in the graph. For instance, in the social network, my friend’s friend’s friend has a higher chance to belong to the same community as me.

## Similar Works

DeepWalk uses a random walk to create a set of vertices that represents a context around the vertex of interest. The path generated by the random walk will either lead to the very faraway node or nodes around the seed. The context that is captured by this process can be unpredictable and depends on the graph structure.

Line focuses on neighbor vertices, which is the same as Breadth-First-Search traversal (BFS). In this case, it captures the local community in the graph.

## Node2Vec

It comes down to what is a proper way to define the walk so that we can capture both community structure and role-based structure. Node2Vec defines a general method of graph traversal that is controlled by 2 parameters, p and q.

The key difference between BFS and DFS samplings is that the BFS is better at exploring the local neighborhoods while DFS is good at exploring larger parts of the network. BFS is viewed as micro-view of the graph whereas DFS characterizes the macro-view of the graph. The authors believe that the mixture of these two classic sampling will improve the representation of the graph embedding.

## Search bias $\alpha$

The unnormalized transitional probability from node v to x is:

$\alpha_{p,q}(t, x) = \frac{1}{p}$ if $d_{tx} = 0$

$\alpha_{p,q}(t, x) = 1$ if $d_{tx} = 1$

$\alpha_{p,q}(t, x) = \frac{1}{q}$ if $d_{tx} = 2$

Where t is the previously visited node, v is the current node, and x is the next node. The distance $d_{tx}$ determines the type of visiting nodes. When the distance between the previous node t and the next node x is zero, it means that we return to the node t.

However, when the distance is 1, it means that we want to visit the node that is directly connected to current node v and the previous node t. It means that node x is shared node between v and t. Hence, this walk will capture the structure of the network (local view).

Lastly, when the distance is 2, we want to hop further away from node t. This is similar to DFS where we want to go deeper into the network graph.

Parameter p and q will control the characteristic of bias walking. The high value of p means that we don’t want to go back often. The high value of q means that we do not want to make too many hops. Hence, p and q control the balance between BFS and DFS sampling.

## Closing

This paper generalizes the random walk by adding parameters to control the walk characteristic. I think this is a neat idea because some information networks may need a specific walk than a general random walk. Thus, this model allows us to define the context based on how much we want to explore the network versus how much we want to exploit the local structure.

## References:

[1] Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016.

https://arxiv.org/pdf/1607.00653.pdf

[2] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

[3] Tang, Jian, et al. “Line: Large-scale information network embedding.” Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015.

# LINE: Large-scale Information Network Embedding (WWW’15)

## Introduction

This paper [1] proposes the method of embedding a large graph structure into low-dimensional space. In contrast to DeepWalk [2], LINE utilizes both first-order (direct edge) and second order proximity (share similar neighbors). The contribution is to use 2nd order proximity to preserve the global structure of the graph. Another benefit of this approach is to increase the number of training samples because information network (graph) can be sparse ( small number of edges ).

## Contribution

The contribution is to use 2nd order proximity to preserve the global structure of the graph. Another benefit of this approach is to increase the number of training samples because information network (graph) can be sparse ( small number of edges ).

Another key contribution is their optimization procedure. An edge sampling method stabilizes the training through stochastic gradient descent. Without this method, the gradient can be exploded and has a high variance which degrades the performance.

## Second-order Proximity

The key observation is that the first-order proximity alone is not sufficient to preserve the network structure due to the small number of observed links/connections between vertices. Hence, observing the neighbor vertices can be helpful. The second proximity between vertex u and v is defined as the similarity between the neighbor vertices of u and v.

## First-order Model

The objective function to preserve the first-order proximity is to encourage the model to find embedding for vertex $v_i$ and $v_j$ such that their embedding $u_i$ and $u_j$ are similar. The model attempts to maximize the following joint probability:

$p_1(v_i, v_j) =$\frac{1}{1 + \exp(-\textbf{u}_i^T\textbf{u}_j)}$Then, they want to make sure that the joint probability $p_1$ is close to the empirical probability $\hat p_1(i, j) = \frac{w_{ij}}{W}$ where $W = \sum_{(i,j) \in E} w_{ij}$. The objective function is: $O_1 = - \sum_{(i,j) \in E} w_{ij} \log p_1(v_i, v_j)$ ## Second-order Model This model assumes that if the two vertices share many common vertices, then they should be similar. These set of sharing vertices is treated as a context. The authors introduce context embedding, basically, each vertex will now additional embedding vector which is a context embedding, $u'_i$. I think the real motivation is to force any vertex embedding that sharing similar context to become closer. It implicitly forces similar embedding vectors between two similar vertices. To measure the similarity between vertex $v_i$ and its context vertex $v_j$, we define the conditional distribution over the given vertex $v_i$ as: $p_2(v_j|v_i) = \frac{\exp({\textbf{u}'_j}^T\textbf{u}_i)}{\sum_{k=1}^{|V|} \exp({\textbf{u}'_k}^T\textbf{u}_i)}$ Then, they also want to the conditional distribution to be similar to the empirical distribution $\hat p_2(v_j|v_i) = \frac{w_{ij}}{d_i}$ where $w_{ij}$ is a weight of edge (i, j) and $d_i$ is the out-degree of vertex i. The objective function is defined as: $O_2 = - \sum_{(i, j) \in E} w_{ij} \log p_2(v_j|v_i)$ ## Combining Model The authors simply concatenate the embedding vectors learned from the first model and second model. A jointly train the objective function is left for the future work. ## Edge Sampling The experimental results show that a straightforward optimization using SGD suffers from the high variance problem. Thus, edge sampling is an important strategy to get the best performance. The main problem is that each edge has different weight. In order to force a binary weight, they sample the edges according to the weights. The sampling strategy is actually simple. Since each edge has different weight, the probability of choosing the edge (i, j) is a ratio of weight (i, j) over the sum of all weight. Since this sampling strategy is computationally expensive, they use the alias table method [3] to speed up the sampling. ## References: [1] Tang, Jian, et al. “Line: Large-scale information network embedding.” Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015. [2] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. My post on DeepWalk: DeepWalk: Online Learning of Social Representation (KDD’14) [3] Li, Aaron Q., et al. “Reducing the sampling complexity of topic models.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. # DeepWalk: Online Learning of Social Representation (KDD’14) DeepWalk is a novel approach for learning a latent representation of vertices in a network. The problem is summarized as we are given a graph containing edges and vertices, we want to embed each vertex into an embedding space. ## How does DeepWalk work? The idea behind this model is that vertices that are near each other should have similar latent vectors. So it really comes down to how to define a set of similar vertices. Interestingly, there is a connection between graph and natural language. The authors show that the distribution of words in natural language, as well as distribution of vertices appearing in short random walks, follow a power-law. This means that a random walk starting at any random vertex is the same as modeling a symbol frequency. Hence, we can use a language model to estimate the likelihood of vertex appearing the given random walk. Basically, DeepWalk model wants to learn a mapping function $\Phi: v \in V \rightarrow \mathcal{R}^{|V|\times d}$ that will maximize the following likelihood function: $P(v_i | \Phi(v_1), \Phi(v_2), \cdots, \Phi(v_{i-1})$ But the above likelihood function becomes more expensive to compute as the length of the walk grows. Hence, they use similar relaxations found in word2vec including word orderless, fixed window size, and skip-gram. Now, the likelihood has changed to: $P(v_{i-w}, \cdots, v_{i-1}, v_{i+1}, \cdots, v_{i+w}|\Phi(v_i)) (1)$ This model is the same the Skip-gram model in word2vec. The author uses a hierarchical softmax to model equation 1. To train the model, the authors will go through each vertex in the graph, perform a random walk of length t, then optimize the objective function. This is an interesting paper and shows the connection between graph structure and word embedding via the local context. ## References: [1] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. https://arxiv.org/pdf/1403.6652.pdf # Deep Autoregressive Network (ICML’14) This is one of the early paper on generative modeling. This work was on arXiv since Oct 2013 before the reparameterization trick has been popularized [2, 3]. It is interesting to look back and see the challenge of training the model with stochastic layers. ## Model Architecture Deep Autoregressive Network (DARN) is a flexible deep generative model with the following architecture features: First, its stochastic layer is stackable. This improves representational power. Second, the deterministic layers can be inserted between the stochastic layers to add complexity to the model. Third, the generative model such as NADE and EoNADE can be used instead of the simple linear autoregressive. This also improves the representational power. The main difference from VAE [2] is that the hidden units are binary vectors (which is similar to the restricted Boltzmann machine). VAE requires a continuous vector as hidden units unless we approximate the discrete units with Gumbel-Softmax. DARN does not assume any form of distribution on its prior $p(h)$ or conditional distribution $p(x|h)$ and $q(h|x)$. The vanilla VAE assumes a standard Gaussian distribution with the diagonal covariance matrix. This could be either good or bad thing for DARN. ## Sampling Since DARN is an autoregressive model, it needs to sample one value at a time, from top hidden layer all the way down to the observed layer. ## Minimum Description Length This is my favorite section of this paper. There is a strong connection between the information theory and variational lowerbound. In EM algorithm, we use Jensen’s inequality to derive the smooth function that acts as a lowerbound of the log likelihood. Some scholars refer this lowerbound as an Evidence Lowerbound (ELBO). This lowerbound can be derived from information theory perspective. From the Shannon’s theory, the description length is: $L(x|h) = -\log_2 p(x|h)$ If $h$ is a compressed version of $x$, then we need to transport $h$ along with the residual in order to reconstruct the original message $x$. The main idea is simple. The less predictable event requires more bits to encode. The shorter bits is better because we will transport fewer bits over the wire. Hence, we want to minimize the description length of the following message $x$: $L(x) = \sum_h q(h|x)(L(h) + L(x|h))$ The $q(h|x)$ is an encode probability of $h$. The description length of the representation $h$ is defined as: $L(h) = -log_2 p(h) + \log_2 q(h|x)$ Finally, the entire description length or Helmholtz variational free energy is: $L(x) = -sum_h q(h|x)(\log_2 p(x,h) - \log_2 q(h|x)) (1)$ This is formula is exactly the same as the ELBO when $h$ is a discrete value. ## Learning The variational free energy formula (1) is intractable because it requires summation over all $h$. DARN employs a sampling approach to learn the parameter. The expectation term is approximated by sampling $h \sim q(H|x)$, then now we can compute the gradient of (1). However, this approach has a high variance. The authors use a few tricks to keep variance low. (Check out the apprendix). ## Closing DARN is one of the early paper that use the stochastic layers as part of its model. Optimization through these layers posed a few challenges such as high variances from the Monte Carlo approximation. References: [1] Gregor, Karol, et al. “Deep autoregressive networks.” arXiv preprint arXiv:1310.8499 (2013). [2] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013). [3] Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and approximate inference in deep generative models.” arXiv preprint arXiv:1401.4082 (2014). # Embedding-based Query Language Models (ICTIR’16) This paper proposes a more accurate query language model for query expansion task based on word embedding. The key contribution is their observation that the cosine similarity score between the top words with the query word is not too distant from the similarity score between the 1000th word with the same query. They use a sigmoid function to amplify the similarity score. The rest of the paper proposes two query language models. ### The first model The first model assumes the conditional independence between the query terms. The model is: $p(w|\theta_q) \propto p(\theta_q|w)p(w) = p(w)\prod_{i=1}^kp(q_i|w)$ It defines: $p(q_i|w) = \frac{\delta(q_i,w)}{\sum_{w'\in V} \delta(w',w)}$ And $p(w) = \sum_{w' \in V} p(w, w') \propto \sum_{w' \in V} \delta(w, w')$ ### The second model $p(w|\theta_q) = \sum_{w'\in V} p(w,w'|\theta_q)p(w'|\theta_q)$ This model now assumes that query and term similarity is independent:$latex p(w,w’|\theta_q) = p(w|w’) = \frac{\delta(w,w’)}{\sum_{w”\in V} \delta(w”, w’)}\$

and the second term uses MLE to estimate the probability:

$p(w'|\theta_q) = \frac{C(w'|Q)}{|Q|}$

The paper also talked about the embedded-based relevance model. The existing model by the same author is now extent with the proposed embedded query language model.

References:

[1] Zamani, Hamed, and W. Bruce Croft. “Embedding-based query language models.” Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval.

https://ciir-publications.cs.umass.edu/getpdf.php?id=1225

# Pseudo-Relevance Feedback based on Matrix Factorization (CIKM’16)

This work uses pseudo-relevance feedback collections to recommend top words and their weights to the input query.

It starts with creating a document-word matrix where each row each retrieved documents and each column is term weight. The first row is the input query’s term weights. The rest of documents either come from relevant feedback or top-K retrieved documents (pseudo-relevant feedback).

Then, we perform a non-negative matrix factorization (NMF) on the matrix. The first row of the reconstructed matrix is a re-estimate query and its top m terms will be used query expansion terms.

However, this model needs to re-compute NMF for every query, this can be expensive operations. Thus, the paper recommends creating a small document-word matrix (e.g use only top 10 retrieved documents).

What makes this model stands out is that the model not only considers the words that discriminate the top documents from the collection, it only considers the relevance of those words to the original query.

I am wondered if calculate the NMF on such a small matrix will really work. The empirical results demonstrate that this idea works but its performance is not impressive. But I think using NMF for a query expansion task is a neat idea.

References:

https://ciir-publications.cs.umass.edu/getpdf.php?id=1224

# Generating images from captions with attention (ICLR’16)

## Overview

This work extends the original DRAW paper [2] to generate images given captions. We can treat this model as conditional DRAW. That is we model a conditional probability of $P(\text{image}|\text{caption})$ The additional textual data controls where to read and write the image.

## AlignDRAW

Generating images from text descriptions is a structure prediction task. That is given a sequence of words, we want to generate an image. Although AlignDRAW has borrowed the same approach as DRAW by combining progressive refinement with attention, incorporating text sequence is their contribution.

The latent variable in DRAW model is sampled from spherical Gaussians, $z_t \sim \mathcal{N}(Z_t|\mu_t, \sigma_t)$ where its mean and variance are functions of the current hidden state of the encoder, e.g. $\mu_t = W(h_t^{\text{enc}})$. However, AlignDRAW adds dependency between latent variables: $z_t \sim \mathcal{N}(\mu(h_{t-1})^{\text{gen}}, \sigma(h_{t-1}^{\text{gen}})$.

During the image generation, DRAW iteratively samples a latent variable $z_t$ from a prior $\mathcal{N}(0, I)$, but AlignDRAW will draw $z_t$ from $P(Z_t|Z_{. It means that there is a dependency between each latent vector in AlignDRAW model.

## Align Operator

The input caption is fed to the BI-Directional RNN. Each output from each time-step needs to be aligned with the current drawing patch. Attention weight is then learned from caption representation up to k words and the current hidden state of the decoder $h_{t-1}^{\text{gen}}$. Finally, compute the weight average of all hidden state of the language model to obtain the caption context, $s_t$. This context together with a latent vector $z_t$ will be fed to the LSTM decoder.

## Objective Function

This model maximizes the expectation of the variational lowerbound. There are 2 terms: the data likelihood and KL loss.

## Closing Thoughts

AlignDRAW uses bi-directional LSTM with attention to aligning each word context with the patches in the image. Some generated images from caption are interesting such as ‘A herd of elephants walking across a dry grass field’. The model generalizes the training data and able to generate novel images.

References:

[1] Mansimov, Elman, et al. “Generating images from captions with attention.” arXiv preprint arXiv:1511.02793 (2015).

[2] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).