# Restricted Boltzmann Machine (RBM)

The classical work in a generative model that has a connection to an artificial neural network is the Boltzmann Machine [1] and Restricted Boltzmann Machine (RBM) [2]. The deep version of RBM [3] shows many successes in pretraining the parameters of the neural networks. There are many good tutorials on RBM and source code that we can find online, my goal for this post is to point out some important aspects of RBM as my review material before posting about more complicated models such as NADE [4].

My favorite tutorial on RBM is by Fischer and Christian [5]. It covers many key ideas for modeling and training RBM. My post is based on this tutorial paper.

What I understand about the RBM is that we are modeling a joint probability of visible and hidden units, $p(\bf v, \bf h)$. RBM removes all connections between hidden units, and result in an efficient learning because all hidden units are conditional independence given visible units.

A graphical model of RBM. Blue nodes are visible units and Yellow nodes are hidden units. We want to infer hidden units from visible units.

One way to define the joint probability is to use Gibbs distribution and it defines $p(\bf v, \bf h) = \frac{1}{Z} e^{-E(\bf v, \bf h)}$, where Z is a normalization constant or a partition function, $Z = \sum_{\bf v, \bf h} e^{-E(\bf v, \bf h)}$. It important to realize that computing the normalization constant is intractable because we have to enumerate all possible $\bf v$ and $\bf h$.

The gradient of log-likelihood is $\log L(\bf \theta | \bf v) = \log p(\bf v; \bf \theta) = \log \frac{1}{Z} \sum_{\bf h} e^{-E(\bf v, \bf h)} =\log \sum_{\bf h} e^{-E(\bf v, \bf h)} -\log \sum_{\bf v, \bf h} e^{-E(\bf v, \bf h)}$. Once we take a derivative w.r.t. to the model parameter, $\bf \theta$, the gradient is:

$\frac{\partial \log L(\bf \theta | \bf v)}{\partial \bf \theta} = - \sum_{\bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial \theta} + \sum_{\bf v, \bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial \theta}$ This expression shows that the gradient is the difference between an expected energy under the model distribution and under the posterial distribution. Directly evaluating these summation is intractable.

RBM defines an energy function as $E(\bf v, \bf h) = -\bf h^T W \bf v - \bf b^T \bf v - \bf c^T \bf h$ We can derive the gradient w.r.t. to $w_{ij}$. First, we derive $\sum_{\bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial w_{ij}} = \sigma( \sum_{j=1}^m w_{ij}v_j + c_i)v_j = p(H_i = 1|\bf v)v_j$. Then, the gradient is become:

$\frac{\partial \log L(\bf \theta | \bf v)}{\partial w_{ij}} = p(H_i=1|\bf v)v_j - \sum_{\bf v} p(\bf v)p(H_i = 1|\bf v)v_j$

We can also show that $p(H_i=1| \bf v)$ is a sigmoid function. A similar derivation can be applied for $p(V_i =1|\bf h) = \sigma(\sum_{i=1}^n w_{ij}h_i + b_j)$. Due to these facts, RBM can be interpreted as a stochastic neural network where the probability of a hidden node being one is dictated by $p(H_i =1 | \bf v)$.

Training RBM is tricky because we can’t directly evaluate the expectation term. One common approximation is to use  Gibbs sampling to sample the most probable hidden and variable units from $p(\bf v, \bf h)$ Then, we can approximate the second term. However, running Gibbs sampling requires many iterations in order to reach a stationary point, Hinton proposed a contrastive divergence where running Gibbs sampling only for k steps.

RBM is the classical stochastic neutral network where it models a joint distribution between visible and hidden units. Training RBM is tricky and requires an approximation of the second log-likelihood term. The modern models such as NADE and VAE turn stochastic layers in a neural network to a deterministic system and thus training can be done through a standard backpropagation. This might be one of the reasons why NADE and VAE are more popular than RBM.

References:

[1] Hinton, Geoffrey E., and Terrence J. Sejnowski. “Optimal perceptual inference.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE New York, 1983.

[2] Hinton, Geoffrey E. “Training products of experts by minimizing contrastive divergence.” Neural computation 14.8 (2002): 1771-1800.

[3] Salakhutdinov, Ruslan, and Geoffrey E. Hinton. “Deep Boltzmann Machines.” AISTATS. Vol. 1. 2009.

[4] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” Journal of Machine Learning Research 17.205 (2016): 1-37.

[5] Fischer, Asja, and Christian Igel. “An introduction to restricted Boltzmann machines.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (2012): 14-36.