Why KL Divergence is always non-negative?

Here is the proof showing that the KL divergence of two distributions p and q is always non-negative.

D_{\text{KL}}(p||q) = \int_{x \in \text{supp}(p)} p(x) \log \frac{p(x)}{q(x)} dx \\  = - \int_{x \in \text{supp}(p)} p(x) \log \frac{q(x)}{p(x)} dx \\  \ge \log\int_{x \in \text{supp}(p)} p(x) \frac{q(x)}{p(x)} dx \\  \ge \log\int_{x \in \text{supp}(p)} q(x) dx \\  \ge \log 1 \\  \ge 0

The key step is to apply the Jensen’s inequality so that the logarithm will be placed outside of the integration.



VAE: Adding More Stochastic Layers gives a tighter lower-bound.

Does adding more stochastic layers to the recognition model (encoder function) give a tighter lower-bound? Daniel Jiwoong (Bengio’s student)’s AAAI paper, “Denoising Criterion for Variational Auto-Encoding Framework. (AAAI 2017)”, claims that this is true.

It has been known that multi-modal recognition models can learn a more complex posterior distribution from the input data (see [2], [3]). Intuitively, adding more stochastic layers increases the complexity of the recognition models.

Lemma 0:

This proof shows that the following inequality is true:

E_{f(x)}[\log f(x)] \ge E_{f(x)}[\log g(x)]

By using the KL divergence property, which is defined as:

D_{KL}(f || g) = \int_x f(x) \log \frac{f(x)}{g(x)} dx = E_{f(x)}[ \log \frac{f(x)}{g(x)} ] \ge 0

Since KL divergence is always non-negative (you can also prove that too.), arranging the inequality will result in the following expression:

E_{f(x)}[ \log f(x)] - E_{f(x)}[ \log g(x)] \ge 0


E_{f(x)}[ \log f(x)] \ge E_{f(x)}[ \log g(x)]

This statement says that the cross entropy for f and g is always at least the entropy of $f$. This makes sense because when we set distribution $g$ to be the same as $f$, we will get the lowest cross entropy.

Lemma 1:

A feedforward network with multiple stochastic layers can be defined as a marginal distribution of multiple latent variables:

q(\textbf{z}|\textbf{x}) = \int_{\textbf{h}}q(\textbf{z}|\textbf{h})q(\textbf{h}|\textbf{x})d\textbf{h} = E_{q(\textbf{h}|\textbf{x})}[q(\textbf{z}|\textbf{h})]


\log p(\textbf{x}) \ge E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \ge E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}]

We will prove that left and right inequality are satisfied.

Right inequality

We start with the right inequality by simplifying the expression:

E_{q(\textbf{z}|\textbf{x})}[\log p(\textbf{x},\textbf{z})]- E_{q(\textbf{z}|\textbf{x})}[q(\textbf{z}|\textbf{h})] \ge E_{q(\textbf{z}|\textbf{x})}[\log p(\textbf{x},\textbf{z})]-E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{x})]

From Lemma 0: if we replace f(x) with q(\textbf{z}|\textbf{x}) and g(x) with q(\textbf{z}|\textbf{h}), we will end up with the following inequality:

E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{x})] \ge E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{h})]

This shows that the right inequality is satisfied.

Left inequality

We expand the encoder function q(\textbf{z}|\textbf{x}) as:

E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(z|\textbf{x})}] = \int_\textbf{z} q(\textbf{z}|\textbf{x})\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}d\textbf{z} \\ = \int_\textbf{z} \int_h q(\textbf{z}|\textbf{h})q(\textbf{h}|\textbf{z})\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}d\textbf{z} d\textbf{h} \\ = E_{q(\textbf{z}|\textbf{h})}E_{q(\textbf{h}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}]

According to the Jensen’s inequality:

E_{q(\textbf{h}|\textbf{x})}E_{q(\textbf{z}|\textbf{h})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \le \log E_{q(\textbf{h}|\textbf{x})}E_{q(\textbf{z}|\textbf{h})}[\frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \\ = \log E_{q(\textbf{h}|\textbf{x})}\int_{\textbf{z}} q(\textbf{z}|\textbf{h})[\frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}]d\textbf{z} \\ = \log E_{q(\textbf{h}|\textbf{x})}[p(\textbf{x})] = \log p(\textbf{x})

The left inequality is also satisfied.


I went over the proof presented in the paper, “Denoising Criterion for Variational Auto-Encoding Framework”. The simple proof on adding one extra stochastic layer shows that we get a tighter lowerbound. The original paper also generalizes its claim to L stochastic layers. By following the same proof strategy, they show that the lowerbound will be tighter as we add more stochastic layers.


[1] Im, Daniel Jiwoong, et al. “Denoising Criterion for Variational Auto-Encoding Framework.” AAAI. 2017.

[2] Kingma, Diederik P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in Neural Information Processing Systems. 2016.

[3] Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP.” arXiv preprint arXiv:1605.08803 (2016).


NNets Methods for NLP – CH8: From Textual Features to Inputs

I highlighted the key concepts from chapter 8 of “Neural Network Methods for NLP” by Yoav Goldberg.

One hot encoding: a sparse feature vector, assign a unique dimension for each possible feature. Each row of W corresponds to a particular feature.

Dense encoding: a dimension is smaller than the number of features. The matrix W is much smaller. It has a better generalization. Can use the pre-trained embedding from a larger text corpus.

Windows-based features: represent a local structure around the focus word.

  • Concat all surrounding words if we care the word position.
  • Sum or average word vectors if we do not care the word position.
  • Use weight sum if we somewhat care the word position.

CBOW – an average of word vectors.

Padding: add a special symbol to the vocabulary e.g. beginning or ending indicators.

Unknown word: a special token represents an unknown word.

Word signature: a fine-grained strategy to deal with unknown words. E.g. any rare word ending with ‘ing’ is replaced with *__ING*; any number is replaced with a *NUM* token.

Word Dropout

  • Replace some infrequent features (words) with an unknown token. But we lose some information.
  • Randomly replace a word with an unknown token. This replacement is based on word frequency. One possible formula is \frac{\alpha}{c(w) + \alpha} where \alpha is dropout aggressiveness.

Word dropout as regularization

Apply word dropout to all words, ignoring the word frequency. Use Bernoulli trial.


Chapter 8, “Neural Network Methods for NLP”, 2nd edition, Yoav Goldberg.


Node2Vec: Scalable feature learning for networks (KDD’16)


This paper proposes a generalization of DeepWalk [2] and Line [3].

Key Ideas

The key observation is that the role of the vertex is also important. For example, given two vertices that are far apart from each other but share similar kind of vertices. Then, they both have similar roles (e.g. hubs or bridges). Breadth-First-Search traversal can capture this graph structure. On the other hands, the community is described as reachability/closeness of the two nodes in the graph. For instance, in the social network, my friend’s friend’s friend has a higher chance to belong to the same community as me.


Similar Works

DeepWalk uses a random walk to create a set of vertices that represents a context around the vertex of interest. The path generated by the random walk will either lead to the very faraway node or nodes around the seed. The context that is captured by this process can be unpredictable and depends on the graph structure.

Line focuses on neighbor vertices, which is the same as Breadth-First-Search traversal (BFS). In this case, it captures the local community in the graph.


It comes down to what is a proper way to define the walk so that we can capture both community structure and role-based structure. Node2Vec defines a general method of graph traversal that is controlled by 2 parameters, p and q.

The key difference between BFS and DFS samplings is that the BFS is better at exploring the local neighborhoods while DFS is good at exploring larger parts of the network. BFS is viewed as micro-view of the graph whereas DFS characterizes the macro-view of the graph. The authors believe that the mixture of these two classic sampling will improve the representation of the graph embedding.

Search bias \alpha


The unnormalized transitional probability from node v to x is:

\alpha_{p,q}(t, x) = \frac{1}{p} if d_{tx} = 0

\alpha_{p,q}(t, x) = 1 if d_{tx} = 1

\alpha_{p,q}(t, x) = \frac{1}{q} if d_{tx} = 2

Where t is the previously visited node, v is the current node, and x is the next node. The distance d_{tx} determines the type of visiting nodes. When the distance between the previous node t and the next node x is zero, it means that we return to the node t.

However, when the distance is 1, it means that we want to visit the node that is directly connected to current node v and the previous node t. It means that node x is shared node between v and t. Hence, this walk will capture the structure of the network (local view).

Lastly, when the distance is 2, we want to hop further away from node t. This is similar to DFS where we want to go deeper into the network graph.

Parameter p and q will control the characteristic of bias walking. The high value of p means that we don’t want to go back often. The high value of q means that we do not want to make too many hops. Hence, p and q control the balance between BFS and DFS sampling.


This paper generalizes the random walk by adding parameters to control the walk characteristic. I think this is a neat idea because some information networks may need a specific walk than a general random walk. Thus, this model allows us to define the context based on how much we want to explore the network versus how much we want to exploit the local structure.


[1] Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016.


[2] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

[3] Tang, Jian, et al. “Line: Large-scale information network embedding.” Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015.

LINE: Large-scale Information Network Embedding (WWW’15)


This paper [1] proposes the method of embedding a large graph structure into low-dimensional space. In contrast to DeepWalk [2], LINE utilizes both first-order (direct edge) and second order proximity (share similar neighbors). The contribution is to use 2nd order proximity to preserve the global structure of the graph. Another benefit of this approach is to increase the number of training samples because information network (graph) can be sparse ( small number of edges ).


The contribution is to use 2nd order proximity to preserve the global structure of the graph. Another benefit of this approach is to increase the number of training samples because information network (graph) can be sparse ( small number of edges ).


Another key contribution is their optimization procedure. An edge sampling method stabilizes the training through stochastic gradient descent. Without this method, the gradient can be exploded and has a high variance which degrades the performance.

Second-order Proximity

The key observation is that the first-order proximity alone is not sufficient to preserve the network structure due to the small number of observed links/connections between vertices. Hence, observing the neighbor vertices can be helpful. The second proximity between vertex u and v is defined as the similarity between the neighbor vertices of u and v.

First-order Model

The objective function to preserve the first-order proximity is to encourage the model to find embedding for vertex v_i and v_j such that their embedding u_i and u_j are similar. The model attempts to maximize the following joint probability:

p_1(v_i, v_j) = \frac{1}{1 + \exp(-\textbf{u}_i^T\textbf{u}_j)}$

Then, they want to make sure that the joint probability p_1 is close to the empirical probability \hat p_1(i, j) = \frac{w_{ij}}{W} where W = \sum_{(i,j) \in E} w_{ij}.

The objective function is:

O_1 = - \sum_{(i,j) \in E} w_{ij} \log p_1(v_i, v_j)

Second-order Model

This model assumes that if the two vertices share many common vertices, then they should be similar. These set of sharing vertices is treated as a context. The authors introduce context embedding, basically, each vertex will now additional embedding vector which is a context embedding, u'_i. I think the real motivation is to force any vertex embedding that sharing similar context to become closer. It implicitly forces similar embedding vectors between two similar vertices.

To measure the similarity between vertex v_i and its context vertex v_j, we define the conditional distribution over the given vertex v_i as:

p_2(v_j|v_i) = \frac{\exp({\textbf{u}'_j}^T\textbf{u}_i)}{\sum_{k=1}^{|V|} \exp({\textbf{u}'_k}^T\textbf{u}_i)}

Then, they also want to the conditional distribution to be similar to the empirical distribution \hat p_2(v_j|v_i) = \frac{w_{ij}}{d_i} where w_{ij} is a weight of edge (i, j) and d_i is the out-degree of vertex i.

The objective function is defined as:

O_2 = - \sum_{(i, j) \in E} w_{ij} \log p_2(v_j|v_i)

Combining Model

The authors simply concatenate the embedding vectors learned from the first model and second model. A jointly train the objective function is left for the future work.

Edge Sampling

The experimental results show that a straightforward optimization using SGD suffers from the high variance problem. Thus, edge sampling is an important strategy to get the best performance.

The main problem is that each edge has different weight. In order to force a binary weight, they sample the edges according to the weights. The sampling strategy is actually simple. Since each edge has different weight, the probability of choosing the edge (i, j) is a ratio of weight (i, j) over the sum of all weight. Since this sampling strategy is computationally expensive, they use the alias table method [3] to speed up the sampling.


[1] Tang, Jian, et al. “Line: Large-scale information network embedding.” Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015.

[2] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

My post on DeepWalk: DeepWalk: Online Learning of Social Representation (KDD’14)

[3] Li, Aaron Q., et al. “Reducing the sampling complexity of topic models.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

DeepWalk: Online Learning of Social Representation (KDD’14)

DeepWalk is a novel approach for learning a latent representation of vertices in a network. The problem is summarized as we are given a graph containing edges and vertices, we want to embed each vertex into an embedding space.


How does DeepWalk work?

The idea behind this model is that vertices that are near each other should have similar latent vectors. So it really comes down to how to define a set of similar vertices. Interestingly, there is a connection between graph and natural language. The authors show that the distribution of words in natural language, as well as distribution of vertices appearing in short random walks, follow a power-law.


This means that a random walk starting at any random vertex is the same as modeling a symbol frequency. Hence, we can use a language model to estimate the likelihood of vertex appearing the given random walk.

Basically, DeepWalk model wants to learn a mapping function \Phi: v \in V \rightarrow \mathcal{R}^{|V|\times d} that will maximize the following likelihood function:

P(v_i | \Phi(v_1), \Phi(v_2), \cdots, \Phi(v_{i-1})

But the above likelihood function becomes more expensive to compute as the length of the walk grows. Hence, they use similar relaxations found in word2vec including word orderless, fixed window size, and skip-gram.

Now, the likelihood has changed to:

P(v_{i-w}, \cdots, v_{i-1}, v_{i+1}, \cdots, v_{i+w}|\Phi(v_i)) (1)

This model is the same the Skip-gram model in word2vec. The author uses a hierarchical softmax to model equation 1.

To train the model, the authors will go through each vertex in the graph, perform a random walk of length t, then optimize the objective function.

This is an interesting paper and shows the connection between graph structure and word embedding via the local context.


[1] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.


Deep Autoregressive Network (ICML’14)

This is one of the early paper on generative modeling. This work was on arXiv since Oct 2013 before the reparameterization trick has been popularized [2, 3]. It is interesting to look back and see the challenge of training the model with stochastic layers.

Model Architecture

Deep Autoregressive Network (DARN) is a flexible deep generative model with the following architecture features: First, its stochastic layer is stackable. This improves representational power. Second, the deterministic layers can be inserted between the stochastic layers to add complexity to the model.  Third, the generative model such as NADE and EoNADE can be used instead of the simple linear autoregressive. This also improves the representational power.



The main difference from VAE [2] is that the hidden units are binary vectors (which is similar to the restricted Boltzmann machine). VAE requires a continuous vector as hidden units unless we approximate the discrete units with Gumbel-Softmax.

DARN does not assume any form of distribution on its prior p(h) or conditional distribution p(x|h) and q(h|x). The vanilla VAE assumes a standard Gaussian distribution with the diagonal covariance matrix. This could be either good or bad thing for DARN.


Since DARN is an autoregressive model, it needs to sample one value at a time, from top hidden layer all the way down to the observed layer.

Minimum Description Length

This is my favorite section of this paper. There is a strong connection between the information theory and variational lowerbound. In EM algorithm, we use Jensen’s inequality to derive the smooth function that acts as a lowerbound of the log likelihood. Some scholars refer this lowerbound as an Evidence Lowerbound (ELBO).

This lowerbound can be derived from information theory perspective. From the Shannon’s theory, the description length is:

L(x|h) = -\log_2 p(x|h)

If h is a compressed version of x, then we need to transport h along with the residual in order to reconstruct the original message x.

The main idea is simple. The less predictable event requires more bits to encode. The shorter bits is better because we will transport fewer bits over the wire. Hence, we want to minimize the description length of the following message x:

L(x) = \sum_h q(h|x)(L(h) + L(x|h))

The q(h|x) is an encode probability of h. The description length of the representation h is defined as:

L(h) = -log_2 p(h) + \log_2 q(h|x)

Finally, the entire description length or Helmholtz variational free energy is:

L(x) = -sum_h q(h|x)(\log_2 p(x,h) - \log_2 q(h|x)) (1)

This is formula is exactly the same as the ELBO when h is a discrete value.


The variational free energy formula (1) is intractable because it requires summation over all h. DARN employs a sampling approach to learn the parameter.

The expectation term is approximated by sampling h \sim q(H|x), then now we can compute the gradient of (1). However, this approach has a high variance. The authors use a few tricks to keep variance low. (Check out the apprendix).


DARN is one of the early paper that use the stochastic layers as part of its model. Optimization through these layers posed a few challenges such as high variances from the Monte Carlo approximation.


[1] Gregor, Karol, et al. “Deep autoregressive networks.” arXiv preprint arXiv:1310.8499 (2013).

[2] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[3] Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and approximate inference in deep generative models.” arXiv preprint arXiv:1401.4082 (2014).