Why KL Divergence is always non-negative?

Here is the proof showing that the KL divergence of two distributions p and q is always non-negative.

D_{\text{KL}}(p||q) = \int_{x \in \text{supp}(p)} p(x) \log \frac{p(x)}{q(x)} dx \\  = - \int_{x \in \text{supp}(p)} p(x) \log \frac{q(x)}{p(x)} dx \\  \ge \log\int_{x \in \text{supp}(p)} p(x) \frac{q(x)}{p(x)} dx \\  \ge \log\int_{x \in \text{supp}(p)} q(x) dx \\  \ge \log 1 \\  \ge 0

The key step is to apply the Jensen’s inequality so that the logarithm will be placed outside of the integration.



VAE: Adding More Stochastic Layers gives a tighter lower-bound.

Does adding more stochastic layers to the recognition model (encoder function) give a tighter lower-bound? Daniel Jiwoong (Bengio’s student)’s AAAI paper, “Denoising Criterion for Variational Auto-Encoding Framework. (AAAI 2017)”, claims that this is true.

It has been known that multi-modal recognition models can learn a more complex posterior distribution from the input data (see [2], [3]). Intuitively, adding more stochastic layers increases the complexity of the recognition models.

Lemma 0:

This proof shows that the following inequality is true:

E_{f(x)}[\log f(x)] \ge E_{f(x)}[\log g(x)]

By using the KL divergence property, which is defined as:

D_{KL}(f || g) = \int_x f(x) \log \frac{f(x)}{g(x)} dx = E_{f(x)}[ \log \frac{f(x)}{g(x)} ] \ge 0

Since KL divergence is always non-negative (you can also prove that too.), arranging the inequality will result in the following expression:

E_{f(x)}[ \log f(x)] - E_{f(x)}[ \log g(x)] \ge 0


E_{f(x)}[ \log f(x)] \ge E_{f(x)}[ \log g(x)]

This statement says that the cross entropy for f and g is always at least the entropy of $f$. This makes sense because when we set distribution $g$ to be the same as $f$, we will get the lowest cross entropy.

Lemma 1:

A feedforward network with multiple stochastic layers can be defined as a marginal distribution of multiple latent variables:

q(\textbf{z}|\textbf{x}) = \int_{\textbf{h}}q(\textbf{z}|\textbf{h})q(\textbf{h}|\textbf{x})d\textbf{h} = E_{q(\textbf{h}|\textbf{x})}[q(\textbf{z}|\textbf{h})]


\log p(\textbf{x}) \ge E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \ge E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}]

We will prove that left and right inequality are satisfied.

Right inequality

We start with the right inequality by simplifying the expression:

E_{q(\textbf{z}|\textbf{x})}[\log p(\textbf{x},\textbf{z})]- E_{q(\textbf{z}|\textbf{x})}[q(\textbf{z}|\textbf{h})] \ge E_{q(\textbf{z}|\textbf{x})}[\log p(\textbf{x},\textbf{z})]-E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{x})]

From Lemma 0: if we replace f(x) with q(\textbf{z}|\textbf{x}) and g(x) with q(\textbf{z}|\textbf{h}), we will end up with the following inequality:

E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{x})] \ge E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{h})]

This shows that the right inequality is satisfied.

Left inequality

We expand the encoder function q(\textbf{z}|\textbf{x}) as:

E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(z|\textbf{x})}] = \int_\textbf{z} q(\textbf{z}|\textbf{x})\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}d\textbf{z} \\ = \int_\textbf{z} \int_h q(\textbf{z}|\textbf{h})q(\textbf{h}|\textbf{z})\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}d\textbf{z} d\textbf{h} \\ = E_{q(\textbf{z}|\textbf{h})}E_{q(\textbf{h}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}]

According to the Jensen’s inequality:

E_{q(\textbf{h}|\textbf{x})}E_{q(\textbf{z}|\textbf{h})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \le \log E_{q(\textbf{h}|\textbf{x})}E_{q(\textbf{z}|\textbf{h})}[\frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \\ = \log E_{q(\textbf{h}|\textbf{x})}\int_{\textbf{z}} q(\textbf{z}|\textbf{h})[\frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}]d\textbf{z} \\ = \log E_{q(\textbf{h}|\textbf{x})}[p(\textbf{x})] = \log p(\textbf{x})

The left inequality is also satisfied.


I went over the proof presented in the paper, “Denoising Criterion for Variational Auto-Encoding Framework”. The simple proof on adding one extra stochastic layer shows that we get a tighter lowerbound. The original paper also generalizes its claim to L stochastic layers. By following the same proof strategy, they show that the lowerbound will be tighter as we add more stochastic layers.


[1] Im, Daniel Jiwoong, et al. “Denoising Criterion for Variational Auto-Encoding Framework.” AAAI. 2017.

[2] Kingma, Diederik P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in Neural Information Processing Systems. 2016.

[3] Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP.” arXiv preprint arXiv:1605.08803 (2016).


NNets Methods for NLP – CH8: From Textual Features to Inputs

I highlighted the key concepts from chapter 8 of “Neural Network Methods for NLP” by Yoav Goldberg.

One hot encoding: a sparse feature vector, assign a unique dimension for each possible feature. Each row of W corresponds to a particular feature.

Dense encoding: a dimension is smaller than the number of features. The matrix W is much smaller. It has a better generalization. Can use the pre-trained embedding from a larger text corpus.

Windows-based features: represent a local structure around the focus word.

  • Concat all surrounding words if we care the word position.
  • Sum or average word vectors if we do not care the word position.
  • Use weight sum if we somewhat care the word position.

CBOW – an average of word vectors.

Padding: add a special symbol to the vocabulary e.g. beginning or ending indicators.

Unknown word: a special token represents an unknown word.

Word signature: a fine-grained strategy to deal with unknown words. E.g. any rare word ending with ‘ing’ is replaced with *__ING*; any number is replaced with a *NUM* token.

Word Dropout

  • Replace some infrequent features (words) with an unknown token. But we lose some information.
  • Randomly replace a word with an unknown token. This replacement is based on word frequency. One possible formula is \frac{\alpha}{c(w) + \alpha} where \alpha is dropout aggressiveness.

Word dropout as regularization

Apply word dropout to all words, ignoring the word frequency. Use Bernoulli trial.


Chapter 8, “Neural Network Methods for NLP”, 2nd edition, Yoav Goldberg.


Restricted Boltzmann Machine (RBM)

The classical work in a generative model that has a connection to an artificial neural network is the Boltzmann Machine [1] and Restricted Boltzmann Machine (RBM) [2]. The deep version of RBM [3] shows many successes in pretraining the parameters of the neural networks. There are many good tutorials on RBM and source code that we can find online, my goal for this post is to point out some important aspects of RBM as my review material before posting about more complicated models such as NADE [4].

My favorite tutorial on RBM is by Fischer and Christian [5]. It covers many key ideas for modeling and training RBM. My post is based on this tutorial paper.

What I understand about the RBM is that we are modeling a joint probability of visible and hidden units, p(\bf v, \bf h). RBM removes all connections between hidden units, and result in an effective learning because all hidden units are conditional independence given visible units.


A graphical model of RBM. Blue nodes are visible units and Yellow nodes are hidden units. We want to infer hidden units from visible units.

One way to define the joint probability is to use Gibbs distribution and it defines p(\bf v, \bf h) = \frac{1}{Z} e^{-E(\bf v, \bf h)}, where Z is a normalization constant or a partition function, Z = \sum_{\bf v, \bf h} e^{-E(\bf v, \bf h)}. It important to realize that computing the normalization constant is intractable because we have to enumerate all possible \bf v and \bf h.

The gradient of log-likelihood is \log L(\bf \theta | \bf v) =  \log p(\bf v; \bf \theta) = \log \frac{1}{Z} \sum_{\bf h} e^{-E(\bf v, \bf h)} =\log \sum_{\bf h} e^{-E(\bf v, \bf h)}  -\log \sum_{\bf v, \bf h} e^{-E(\bf v, \bf h)}. Once we take a derivative w.r.t. to the model parameter, \bf \theta, the gradient is:

\frac{\partial \log L(\bf \theta | \bf v)}{\partial \bf \theta} = - \sum_{\bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial \theta} + \sum_{\bf v, \bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial \theta} This expression shows that the gradient is the difference between an expected energy under the model distribution and under the posterial distribution. Directly evaluating these summation is intractable.

RBM defines an energy function as E(\bf v, \bf h) = -\bf h^T W \bf v - \bf b^T \bf v - \bf c^T \bf h We can derive the gradient w.r.t. to w_{ij}. First, we derive \sum_{\bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial w_{ij}} = \sigma( \sum_{j=1}^m w_{ij}v_j + c_i)v_j = p(H_i = 1|\bf v)v_j . Then, the gradient is become:

 \frac{\partial \log L(\bf \theta | \bf v)}{\partial w_{ij}} = p(H_i=1|\bf v)v_j - \sum_{\bf v} p(\bf v)p(H_i = 1|\bf v)v_j

We can also show that p(H_i=1| \bf v) is a sigmoid function. A similar derivation can be applied for p(V_i =1|\bf h) = \sigma(\sum_{i=1}^n w_{ij}h_i + b_j) . Due to these facts, RBM can be interpreted as a stochastic neural network where the probability of a hidden node being one is dictated by p(H_i =1 | \bf v).

Training RBM is tricky because we can’t directly evaluate the expected term. One common approximation is to use  Gibbs sampling to sample the most probable hidden and variable units from p(\bf v, \bf h) Then, we can approximate the second term. However, running Gibbs sampling requires many iterations in order to reach a stationary point, Hinton proposed a contrastive divergence where running Gibbs sampling only for k steps.

RBM is the classical stochastic neural network where it models a joint distribution between visible and hidden units. Training RBM is tricky and requires an approximation of the second log-likelihood term. The modern models such as NADE and VAE turn stochastic layers in a neural network to a deterministic system and thus training can be done through a standard backpropagation. This might be one of the reasons why NADE and VAE are more popular than RBM.


[1] Hinton, Geoffrey E., and Terrence J. Sejnowski. “Optimal perceptual inference.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE New York, 1983.

[2] Hinton, Geoffrey E. “Training products of experts by minimizing contrastive divergence.” Neural Computation 14.8 (2002): 1771-1800.

[3] Salakhutdinov, Ruslan, and Geoffrey E. Hinton. “Deep Boltzmann Machines.” AISTATS. Vol. 1. 2009.

[4] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” Journal of Machine Learning Research 17.205 (2016): 1-37.

[5] Fischer, Asja, and Christian Igel. “An introduction to restricted Boltzmann machines.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (2012): 14-36.

Review some Information Theory

Here are my note on some concepts from self-study on Information Theory:

The Shannon information content:

h(x) = \log_2 \frac{1}{p(x)}

This is a measurement of the information content of the event x = a. For example, when we flip a coin, an event of landing a head is 0.5, then h(x=’head’) = 1. It means that we only need one bit for this event. E.g. If we landed a tail, we know that it won’t be a head. One bit is sufficient.

Entropy: an average of Shannon information:

H(X) = -\sum_x p(x) \log_2 p(x)  = E[- \log_2 p(x) ]

We consider all events and average out all the Shannon information. Entropy can be thought as an uncertainty. The higher entropy, the more uncertainty of the event. This implies that the Entropy will be maximum when p(x) is a uniform distribution since we can’t make any prediction.

The Relative entropy (KL Divergence)

KL(P||Q) = \sum_x p(x) \log_2 \frac{p(x)}{q(x)} = E[\log_2 \frac{p(x)}{q(x)}]

KL(P||Q) is always non-negative (Check Gibbs’ inequality). This term will show up often in many machine learning models. It measures how much distribution q(x) diverges from p(x).

The Conditional entropy of X given Y:

This is an average uncertainty that remains about x when y is known. If I know y, then how much I still don’t know about x.

H(X|Y=b_k) = - \sum p(x|y=b_k) \log_2 p(x|y=b_k)

If we average over y:

$latext H(X|Y) = \sum_y p(y) [-\sum_x p(x|y) \log_2 p(x|y) ] = -sum_{x,y} p(x,y) \log p(x|y) $

Chain rule: 

h(x,y) = h(x) + h(y|x)

Information content of x and y is the information content of x plus information coent of y given x.

Entropy Chain rule:

H(X,Y) = H(x) + H(Y|X)

An uncertainty about X and Y is the uncertainty of X plus uncertainty of Y given X.

Mutual Information:

This term appears often in information retrieval and some machine learning models. This term measures the average reduction in uncertainty about x that results from learning the value of y. E.g. How much I will learn about x once I know about y ? What is an average amount of information that y conveys about x?

I(X; Y) = H(X) - H(X|Y)

I(X; Y) = I(Y; X) \ge 0

When we add more data, we always decrease uncertainty.

H(X,Y) \le H(X) + H(Y)

H(Y|X) = H(X,Y) - H(X) \le H(Y) + H(X) - H(X) \le H(Y)


[1] MacKay, David JC. Information theory, inference and learning algorithms. Cambridge university press, 2003.

Memory Network

Today our colleague presented Memory Network [1] at our research seminar. I want to summarize my understanding on this model and compared it with other deep sequential models such as LSTM and RNNs.

In Question-Answering problem, we are given a question to a learning model, and the model will output an answer based on information and facts that it was given during the training. LSTM and Sequence-To-Sequence models are popular choices of models due to they can handle a variable length input which is common in text and these models avoid vanish gradient problem.

The Memory Network extends these earlier models by adding supporting facts as additional inputs. Intuitively, this should help the model to answer the question better because of the extra information. The model will first locate a few good facts (which is 2 sentences in this paper ) and use that as additional inputs.

Locating relevant facts is time-consuming. Thus, the authors proposed a few approximation to speeding up. The first approximation is to only search for a sentence that sharing the same word with the question. This is the fastest method but not accurate. The second approximation is to cluster all word hash and only search for sentences that have at least one word sharing the same cluster as the given question. This method is much more accurate but it is also slower.

There are a few more extensions such as adding time relation among sentences in order to answer more tricky questions.

In summing up, the memory network does not utilize all facts but instead picking a few relevant facts. The process of fact picking is inefficient. Instead, if we assign a weight to each fact, we might be able to utilize more information to answer the given question.  The following up works have exploited this idea and demonstrated better results in QA problem.


[1] Weston, Jason, Sumit Chopra, and Antoine Bordes. “Memory networks.” arXiv preprint arXiv:1410.3916 (2014).

Entropy in Variational Inference

In Variational Inference, the evidence lowerbound (ELBO) is defined as:

\log p(x) = \log \int_z p(x, z) dz = \log E_{q(z)}[\frac{p(x,z)}{q(z)}]
>= E_{q(z)}[\log p(x, z) ]- E_{q(z)}[ \log q(z) ]

At this point, I usually pay attention to the first term, the expected log-likelihood and treat the second term, entropy, as extra baggage. It makes sense that the entropy term is less relevant especially in mean-field approximation because when the complicated approximate distribution is now just a factored of much simpler approximate distributions, q(z) = \prod_i q(z_i) , the entropy term is just a sum of entropy for each hidden variables.

The entropy term shows up again when I worked on the Variational Autoencoder formulation. The entropy term is merge with a prior and result in a relative entropy:

\log p(x) >= E_{q(z)}[\log p(x| z) ] + E_{q(z)}[\log p(z)]- E_{q(z)}[ \log q(z) ]
= E_{q(z)}[\log p(x| z) ] - D_{kl}(q(z)||p(z))

When I pay closer attention, the relative entropy represents a gap (slack) between the lowerbound and the log-likelihood. The high relative entropy means that I badly approximate the true posterior.

When I studied information theory more thoroughly, the entropy can be viewed as the uncertainty of the event of interest. The more uncertainty, the higher entropy because we need to add extra bits to encode these unknown. Then, the entropy in the ELBO could imply how useful the variational distribution. If q(z) is a uniform distribution, then it is useless because we cannot make any good prediction. Thus, we want q(z) to be less flat which automatically means more predictable and lower entropy.

I think I need to be able to look from information theory perspective to get a better sense of how things work in Variational Inference. This knowledge will definitely be helpful for my future research.