This paper  proposes the method of embedding a large graph structure into low-dimensional space. In contrast to DeepWalk , LINE utilizes both first-order (direct edge) and second order proximity (share similar neighbors). The contribution is to use 2nd order proximity to preserve the global structure of the graph. Another benefit of this approach is to increase the number of training samples because information network (graph) can be sparse ( small number of edges ).
The contribution is to use 2nd order proximity to preserve the global structure of the graph. Another benefit of this approach is to increase the number of training samples because information network (graph) can be sparse ( small number of edges ).
Another key contribution is their optimization procedure. An edge sampling method stabilizes the training through stochastic gradient descent. Without this method, the gradient can be exploded and has a high variance which degrades the performance.
The key observation is that the first-order proximity alone is not sufficient to preserve the network structure due to the small number of observed links/connections between vertices. Hence, observing the neighbor vertices can be helpful. The second proximity between vertex u and v is defined as the similarity between the neighbor vertices of u and v.
The objective function to preserve the first-order proximity is to encourage the model to find embedding for vertex and such that their embedding and are similar. The model attempts to maximize the following joint probability:
Then, they want to make sure that the joint probability is close to the empirical probability where .
The objective function is:
This model assumes that if the two vertices share many common vertices, then they should be similar. These set of sharing vertices is treated as a context. The authors introduce context embedding, basically, each vertex will now additional embedding vector which is a context embedding, . I think the real motivation is to force any vertex embedding that sharing similar context to become closer. It implicitly forces similar embedding vectors between two similar vertices.
To measure the similarity between vertex and its context vertex , we define the conditional distribution over the given vertex as:
Then, they also want to the conditional distribution to be similar to the empirical distribution where is a weight of edge (i, j) and is the out-degree of vertex i.
The objective function is defined as:
The authors simply concatenate the embedding vectors learned from the first model and second model. A jointly train the objective function is left for the future work.
The experimental results show that a straightforward optimization using SGD suffers from the high variance problem. Thus, edge sampling is an important strategy to get the best performance.
The main problem is that each edge has different weight. In order to force a binary weight, they sample the edges according to the weights. The sampling strategy is actually simple. Since each edge has different weight, the probability of choosing the edge (i, j) is a ratio of weight (i, j) over the sum of all weight. Since this sampling strategy is computationally expensive, they use the alias table method  to speed up the sampling.
 Tang, Jian, et al. “Line: Large-scale information network embedding.” Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015.
 Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.
My post on DeepWalk: DeepWalk: Online Learning of Social Representation (KDD’14)
 Li, Aaron Q., et al. “Reducing the sampling complexity of topic models.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.