Review some Information Theory

Here are my note on some concepts from self-study on Information Theory:

The Shannon information content:

h(x) = \log_2 \frac{1}{p(x)}

This is a measurement of the information content of the event x = a. For example, when we flip a coin, an event of landing a head is 0.5, then h(x=’head’) = 1. It means that we only need one bit for this event. E.g. If we landed a tail, we know that it won’t be a head. One bit is sufficient.

Entropy: an average of Shannon information:

H(X) = -\sum_x p(x) \log_2 p(x)  = E[- \log_2 p(x) ]

We consider all events and average out all the Shannon information. Entropy can be thought as an uncertainty. The higher entropy, the more uncertainty of the event. This implies that the Entropy will be maximum when p(x) is a uniform distribution since we can’t make any prediction.

The Relative entropy (KL Divergence)

KL(P||Q) = \sum_x p(x) \log_2 \frac{p(x)}{q(x)} = E[\log_2 \frac{p(x)}{q(x)}]

KL(P||Q) is always non-negative (Check Gibbs’ inequality). This term will show up often in many machine learning models. It measures how much distribution q(x) diverges from p(x).

The Conditional entropy of X given Y:

This is an average uncertainty that remains about x when y is known. If I know y, then how much I still don’t know about x.

H(X|Y=b_k) = - \sum p(x|y=b_k) \log_2 p(x|y=b_k)

If we average over y:

$latext H(X|Y) = \sum_y p(y) [-\sum_x p(x|y) \log_2 p(x|y) ] = -sum_{x,y} p(x,y) \log p(x|y) $

Chain rule: 

h(x,y) = h(x) + h(y|x)

Information content of x and y is the information content of x plus information coent of y given x.

Entropy Chain rule:

H(X,Y) = H(x) + H(Y|X)

An uncertainty about X and Y is the uncertainty of X plus uncertainty of Y given X.

Mutual Information:

This term appears often in information retrieval and some machine learning models. This term measures the average reduction in uncertainty about x that results from learning the value of y. E.g. How much I will learn about x once I know about y ? What is an average amount of information that y conveys about x?

I(X; Y) = H(X) - H(X|Y)

I(X; Y) = I(Y; X) \ge 0

When we add more data, we always decrease uncertainty.

H(X,Y) \le H(X) + H(Y)

H(Y|X) = H(X,Y) - H(X) \le H(Y) + H(X) - H(X) \le H(Y)


[1] MacKay, David JC. Information theory, inference and learning algorithms. Cambridge university press, 2003.