Restricted Boltzmann Machine (RBM)

The classical work in a generative model that has a connection to an artificial neural network is the Boltzmann Machine [1] and Restricted Boltzmann Machine (RBM) [2]. The deep version of RBM [3] shows many successes in pretraining the parameters of the neural networks. There are many good tutorials on RBM and source code that we can find online, my goal for this post is to point out some important aspects of RBM as my review material before posting about more complicated models such as NADE [4].

My favorite tutorial on RBM is by Fischer and Christian [5]. It covers many key ideas for modeling and training RBM. My post is based on this tutorial paper.

What I understand about the RBM is that we are modeling a joint probability of visible and hidden units, p(\bf v, \bf h). RBM removes all connections between hidden units, and result in an effective learning because all hidden units are conditional independence given visible units.


A graphical model of RBM. Blue nodes are visible units and Yellow nodes are hidden units. We want to infer hidden units from visible units.

One way to define the joint probability is to use Gibbs distribution and it defines p(\bf v, \bf h) = \frac{1}{Z} e^{-E(\bf v, \bf h)}, where Z is a normalization constant or a partition function, Z = \sum_{\bf v, \bf h} e^{-E(\bf v, \bf h)}. It important to realize that computing the normalization constant is intractable because we have to enumerate all possible \bf v and \bf h.

The gradient of log-likelihood is \log L(\bf \theta | \bf v) =  \log p(\bf v; \bf \theta) = \log \frac{1}{Z} \sum_{\bf h} e^{-E(\bf v, \bf h)} =\log \sum_{\bf h} e^{-E(\bf v, \bf h)}  -\log \sum_{\bf v, \bf h} e^{-E(\bf v, \bf h)}. Once we take a derivative w.r.t. to the model parameter, \bf \theta, the gradient is:

\frac{\partial \log L(\bf \theta | \bf v)}{\partial \bf \theta} = - \sum_{\bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial \theta} + \sum_{\bf v, \bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial \theta} This expression shows that the gradient is the difference between an expected energy under the model distribution and under the posterial distribution. Directly evaluating these summation is intractable.

RBM defines an energy function as E(\bf v, \bf h) = -\bf h^T W \bf v - \bf b^T \bf v - \bf c^T \bf h We can derive the gradient w.r.t. to w_{ij}. First, we derive \sum_{\bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial w_{ij}} = \sigma( \sum_{j=1}^m w_{ij}v_j + c_i)v_j = p(H_i = 1|\bf v)v_j . Then, the gradient is become:

 \frac{\partial \log L(\bf \theta | \bf v)}{\partial w_{ij}} = p(H_i=1|\bf v)v_j - \sum_{\bf v} p(\bf v)p(H_i = 1|\bf v)v_j

We can also show that p(H_i=1| \bf v) is a sigmoid function. A similar derivation can be applied for p(V_i =1|\bf h) = \sigma(\sum_{i=1}^n w_{ij}h_i + b_j) . Due to these facts, RBM can be interpreted as a stochastic neural network where the probability of a hidden node being one is dictated by p(H_i =1 | \bf v).

Training RBM is tricky because we can’t directly evaluate the expected term. One common approximation is to use  Gibbs sampling to sample the most probable hidden and variable units from p(\bf v, \bf h) Then, we can approximate the second term. However, running Gibbs sampling requires many iterations in order to reach a stationary point, Hinton proposed a contrastive divergence where running Gibbs sampling only for k steps.

RBM is the classical stochastic neural network where it models a joint distribution between visible and hidden units. Training RBM is tricky and requires an approximation of the second log-likelihood term. The modern models such as NADE and VAE turn stochastic layers in a neural network to a deterministic system and thus training can be done through a standard backpropagation. This might be one of the reasons why NADE and VAE are more popular than RBM.


[1] Hinton, Geoffrey E., and Terrence J. Sejnowski. “Optimal perceptual inference.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE New York, 1983.

[2] Hinton, Geoffrey E. “Training products of experts by minimizing contrastive divergence.” Neural Computation 14.8 (2002): 1771-1800.

[3] Salakhutdinov, Ruslan, and Geoffrey E. Hinton. “Deep Boltzmann Machines.” AISTATS. Vol. 1. 2009.

[4] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” Journal of Machine Learning Research 17.205 (2016): 1-37.

[5] Fischer, Asja, and Christian Igel. “An introduction to restricted Boltzmann machines.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (2012): 14-36.


Review some Information Theory

Here are my note on some concepts from self-study on Information Theory:

The Shannon information content:

h(x) = \log_2 \frac{1}{p(x)}

This is a measurement of the information content of the event x = a. For example, when we flip a coin, an event of landing a head is 0.5, then h(x=’head’) = 1. It means that we only need one bit for this event. E.g. If we landed a tail, we know that it won’t be a head. One bit is sufficient.

Entropy: an average of Shannon information:

H(X) = -\sum_x p(x) \log_2 p(x)  = E[- \log_2 p(x) ]

We consider all events and average out all the Shannon information. Entropy can be thought as an uncertainty. The higher entropy, the more uncertainty of the event. This implies that the Entropy will be maximum when p(x) is a uniform distribution since we can’t make any prediction.

The Relative entropy (KL Divergence)

KL(P||Q) = \sum_x p(x) \log_2 \frac{p(x)}{q(x)} = E[\log_2 \frac{p(x)}{q(x)}]

KL(P||Q) is always non-negative (Check Gibbs’ inequality). This term will show up often in many machine learning models. It measures how much distribution q(x) diverges from p(x).

The Conditional entropy of X given Y:

This is an average uncertainty that remains about x when y is known. If I know y, then how much I still don’t know about x.

H(X|Y=b_k) = - \sum p(x|y=b_k) \log_2 p(x|y=b_k)

If we average over y:

$latext H(X|Y) = \sum_y p(y) [-\sum_x p(x|y) \log_2 p(x|y) ] = -sum_{x,y} p(x,y) \log p(x|y) $

Chain rule: 

h(x,y) = h(x) + h(y|x)

Information content of x and y is the information content of x plus information coent of y given x.

Entropy Chain rule:

H(X,Y) = H(x) + H(Y|X)

An uncertainty about X and Y is the uncertainty of X plus uncertainty of Y given X.

Mutual Information:

This term appears often in information retrieval and some machine learning models. This term measures the average reduction in uncertainty about x that results from learning the value of y. E.g. How much I will learn about x once I know about y ? What is an average amount of information that y conveys about x?

I(X; Y) = H(X) - H(X|Y)

I(X; Y) = I(Y; X) \ge 0

When we add more data, we always decrease uncertainty.

H(X,Y) \le H(X) + H(Y)

H(Y|X) = H(X,Y) - H(X) \le H(Y) + H(X) - H(X) \le H(Y)


[1] MacKay, David JC. Information theory, inference and learning algorithms. Cambridge university press, 2003.

Entropy in Variational Inference

In Variational Inference, the evidence lowerbound (ELBO) is defined as:

\log p(x) = \log \int_z p(x, z) dz = \log E_{q(z)}[\frac{p(x,z)}{q(z)}]
>= E_{q(z)}[\log p(x, z) ]- E_{q(z)}[ \log q(z) ]

At this point, I usually pay attention to the first term, the expected log-likelihood and treat the second term, entropy, as extra baggage. It makes sense that the entropy term is less relevant especially in mean-field approximation because when the complicated approximate distribution is now just a factored of much simpler approximate distributions, q(z) = \prod_i q(z_i) , the entropy term is just a sum of entropy for each hidden variables.

The entropy term shows up again when I worked on the Variational Autoencoder formulation. The entropy term is merge with a prior and result in a relative entropy:

\log p(x) >= E_{q(z)}[\log p(x| z) ] + E_{q(z)}[\log p(z)]- E_{q(z)}[ \log q(z) ]
= E_{q(z)}[\log p(x| z) ] - D_{kl}(q(z)||p(z))

When I pay closer attention, the relative entropy represents a gap (slack) between the lowerbound and the log-likelihood. The high relative entropy means that I badly approximate the true posterior.

When I studied information theory more thoroughly, the entropy can be viewed as the uncertainty of the event of interest. The more uncertainty, the higher entropy because we need to add extra bits to encode these unknown. Then, the entropy in the ELBO could imply how useful the variational distribution. If q(z) is a uniform distribution, then it is useless because we cannot make any good prediction. Thus, we want q(z) to be less flat which automatically means more predictable and lower entropy.

I think I need to be able to look from information theory perspective to get a better sense of how things work in Variational Inference. This knowledge will definitely be helpful for my future research.