Entropy in Variational Inference

In Variational Inference, the evidence lowerbound (ELBO) is defined as:

\log p(x) = \log \int_z p(x, z) dz = \log E_{q(z)}[\frac{p(x,z)}{q(z)}]
>= E_{q(z)}[\log p(x, z) ]- E_{q(z)}[ \log q(z) ]

At this point, I usually pay attention to the first term, the expected log-likelihood and treat the second term, entropy, as extra baggage. It makes sense that the entropy term is less relevant especially in mean-field approximation because when the complicated approximate distribution is now just a factored of much simpler approximate distributions, q(z) = \prod_i q(z_i) , the entropy term is just a sum of entropy for each hidden variables.

The entropy term shows up again when I worked on the Variational Autoencoder formulation. The entropy term is merge with a prior and result in a relative entropy:

\log p(x) >= E_{q(z)}[\log p(x| z) ] + E_{q(z)}[\log p(z)]- E_{q(z)}[ \log q(z) ]
= E_{q(z)}[\log p(x| z) ] - D_{kl}(q(z)||p(z))

When I pay closer attention, the relative entropy represents a gap (slack) between the lowerbound and the log-likelihood. The high relative entropy means that I badly approximate the true posterior.

When I studied information theory more thoroughly, the entropy can be viewed as the uncertainty of the event of interest. The more uncertainty, the higher entropy because we need to add extra bits to encode these unknown. Then, the entropy in the ELBO could imply how useful the variational distribution. If q(z) is a uniform distribution, then it is useless because we cannot make any good prediction. Thus, we want q(z) to be less flat which automatically means more predictable and lower entropy.

I think I need to be able to look from information theory perspective to get a better sense of how things work in Variational Inference. This knowledge will definitely be helpful for my future research.