Does adding more stochastic layers to the recognition model (encoder function) give a tighter lower-bound? Daniel Jiwoong (Bengio’s student)’s AAAI paper, “Denoising Criterion for Variational Auto-Encoding Framework. (AAAI 2017)”, claims that this is true.

It has been known that multi-modal recognition models can learn a more complex posterior distribution from the input data (see [2], [3]). Intuitively, adding more stochastic layers increases the complexity of the recognition models.

## Lemma 0:

This proof shows that the following inequality is true:

By using the KL divergence property, which is defined as:

Since KL divergence is always non-negative (you can also prove that too.), arranging the inequality will result in the following expression:

Hence,

This statement says that the cross entropy for and is always at least the entropy of $f$. This makes sense because when we set distribution $g$ to be the same as $f$, we will get the lowest cross entropy.

## Lemma 1:

A feedforward network with multiple stochastic layers can be defined as a marginal distribution of multiple latent variables:

Then,

We will prove that left and right inequality are satisfied.

### Right inequality

We start with the right inequality by simplifying the expression:

From Lemma 0: if we replace with and with , we will end up with the following inequality:

This shows that the right inequality is satisfied.

### Left inequality

We expand the encoder function as:

According to the Jensen’s inequality:

The left inequality is also satisfied.

## Closing

I went over the proof presented in the paper, “Denoising Criterion for Variational Auto-Encoding Framework”. The simple proof on adding one extra stochastic layer shows that we get a tighter lowerbound. The original paper also generalizes its claim to L stochastic layers. By following the same proof strategy, they show that the lowerbound will be tighter as we add more stochastic layers.

**Reference:**

[1] Im, Daniel Jiwoong, et al. “Denoising Criterion for Variational Auto-Encoding Framework.” AAAI. 2017.

[2] Kingma, Diederik P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in Neural Information Processing Systems. 2016.

[3] Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP.” *arXiv preprint arXiv:1605.08803* (2016).