Here is the proof showing that the KL divergence of two distributions and is always non-negative.
The key step is to apply the Jensen’s inequality so that the logarithm will be placed outside of the integration.
Here is the proof showing that the KL divergence of two distributions and is always non-negative.
The key step is to apply the Jensen’s inequality so that the logarithm will be placed outside of the integration.
Does adding more stochastic layers to the recognition model (encoder function) give a tighter lower-bound? Daniel Jiwoong (Bengio’s student)’s AAAI paper, “Denoising Criterion for Variational Auto-Encoding Framework. (AAAI 2017)”, claims that this is true.
It has been known that multi-modal recognition models can learn a more complex posterior distribution from the input data (see [2], [3]). Intuitively, adding more stochastic layers increases the complexity of the recognition models.
This proof shows that the following inequality is true:
By using the KL divergence property, which is defined as:
Since KL divergence is always non-negative (you can also prove that too.), arranging the inequality will result in the following expression:
Hence,
This statement says that the cross entropy for and is always at least the entropy of $f$. This makes sense because when we set distribution $g$ to be the same as $f$, we will get the lowest cross entropy.
A feedforward network with multiple stochastic layers can be defined as a marginal distribution of multiple latent variables:
Then,
We will prove that left and right inequality are satisfied.
We start with the right inequality by simplifying the expression:
From Lemma 0: if we replace with and with , we will end up with the following inequality:
This shows that the right inequality is satisfied.
We expand the encoder function as:
According to the Jensen’s inequality:
The left inequality is also satisfied.
I went over the proof presented in the paper, “Denoising Criterion for Variational Auto-Encoding Framework”. The simple proof on adding one extra stochastic layer shows that we get a tighter lowerbound. The original paper also generalizes its claim to L stochastic layers. By following the same proof strategy, they show that the lowerbound will be tighter as we add more stochastic layers.
Reference:
[1] Im, Daniel Jiwoong, et al. “Denoising Criterion for Variational Auto-Encoding Framework.” AAAI. 2017.
[2] Kingma, Diederik P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in Neural Information Processing Systems. 2016.
[3] Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP.” arXiv preprint arXiv:1605.08803 (2016).