Max function

Max function is a useful function in statistic because we often want to select the highest score from the group or query the highest value from a set of numbers.

Argmax is also as important as max function. We often see argmax in machine learning algorithm because we usually want to know which is the best item instance (with respect to the highest score) from a group of items.

But both functions are not differentiable so the gradient-based optimizer cannot handle these functions. Hence, the smooth approximation of max and argmax functions have been used so that we can employ these function in a computational graph.


We start with softmax which I think it should be named soft-argmax because this function returns a one-hot encoder of the item with the highest value.


Given a vector \textbf{x} = \{ x_1, x_2, \cdots, x_n \}, then

\text{softmax}(\textbf{x}) = \frac{1}{Z}\bigg[ \exp{(x_1)}, \exp{(x_2)}, \cdots, \exp{(x_n)} \bigg] where Z = \sum_{i=1}^N \exp{(x_i)}

The exponential function amplifies the high value to be even higher value. Together with a normalization, it pushes down other smaller numbers closer to zero. Since the sum of all entries is 1, the highest value will end up with a value closer to 1. Hence, the output vector is a sparse vector with a single entry with a value that is close to 1. This is similar to a one-hot encoder.


The max function can be approximated as follows:

\text{max}(\textbf{x}) \approx \log \sum_{i=1}^N \exp{(x_i)}

The exponential function will make the largest number become even larger so that other values are way smaller. The total sum is not different from the largest value. For example, if we have 1 billion, and a lot of one hundred, summing them up we still get the total around 1 billion.

However, the above equation causes an overflow or underflow when we put this formula on the computer. The common fix is to dampen each value by a fixed constant:

\log \sum_{i=1}^N = \log \bigg( \exp{(a)} \cdot \sum_{i=1}^N \frac{\exp{(x_i)}}{\exp{(a)}} \bigg) \\  = a + \log \bigg( \sum_{i=1}^N \exp{(x_i - a)} \bigg)

We set a to be the maximum value a = \text{max}(\textbf{x}).




Frobenius norm and Trace

I found the connection between Frobenius norm and trace operation is simple and elegant.

Frobenius norm is a square-root sum of absolute squares of every element. It defines as:

||X||_F = \sqrt{\sum_i \sum_j |a_{ij}|^2}

If all elements in the matrix is real number, then we do not need an absolute operator.

Now, we re-express matrix X such that each row is one vector:

X =  \begin{bmatrix} \vec{r_1} \\ \vec{r_2} \\ \cdots \\ \vec{r_m} \end{bmatrix}

Since the sum of square of \vec{r_i} is a dot product \vec{r_i}^T\vec{r_i}, we can compute the sum of squres of all elements by:

XX^T =  \begin{bmatrix} \vec{r_1} \\ \vec{r_2} \\ \cdots \\ \vec{r_m} \end{bmatrix}\begin{bmatrix} \vec{r_1} \\ \vec{r_2} \\ \cdots \\ \vec{r_m} \end{bmatrix}^T

Any element along the diagonal of XX^T is a sum square of element from one of the row of X

To compute the total sum squares of every element, we can sum along the diagonal:

\sum_i (XX^T)_{ii} = \text{Tr}(XX^T)

And a sum along a diagonal of matrix is a trace operation.

Why KL Divergence is always non-negative?

Here is the proof showing that the KL divergence of two distributions p and q is always non-negative.

D_{\text{KL}}(p||q) = \int_{x \in \text{supp}(p)} p(x) \log \frac{p(x)}{q(x)} dx \\  = - \int_{x \in \text{supp}(p)} p(x) \log \frac{q(x)}{p(x)} dx \\  \ge \log\int_{x \in \text{supp}(p)} p(x) \frac{q(x)}{p(x)} dx \\  \ge \log\int_{x \in \text{supp}(p)} q(x) dx \\  \ge \log 1 \\  \ge 0

The key step is to apply the Jensen’s inequality so that the logarithm will be placed outside of the integration.


VAE: Adding More Stochastic Layers gives a tighter lower-bound.

Does adding more stochastic layers to the recognition model (encoder function) give a tighter lower-bound? Daniel Jiwoong (Bengio’s student)’s AAAI paper, “Denoising Criterion for Variational Auto-Encoding Framework. (AAAI 2017)”, claims that this is true.

It has been known that multi-modal recognition models can learn a more complex posterior distribution from the input data (see [2], [3]). Intuitively, adding more stochastic layers increases the complexity of the recognition models.

Lemma 0:

This proof shows that the following inequality is true:

E_{f(x)}[\log f(x)] \ge E_{f(x)}[\log g(x)]

By using the KL divergence property, which is defined as:

D_{KL}(f || g) = \int_x f(x) \log \frac{f(x)}{g(x)} dx = E_{f(x)}[ \log \frac{f(x)}{g(x)} ] \ge 0

Since KL divergence is always non-negative (you can also prove that too.), arranging the inequality will result in the following expression:

E_{f(x)}[ \log f(x)] - E_{f(x)}[ \log g(x)] \ge 0


E_{f(x)}[ \log f(x)] \ge E_{f(x)}[ \log g(x)]

This statement says that the cross entropy for f and g is always at least the entropy of $f$. This makes sense because when we set distribution $g$ to be the same as $f$, we will get the lowest cross entropy.

Lemma 1:

A feedforward network with multiple stochastic layers can be defined as a marginal distribution of multiple latent variables:

q(\textbf{z}|\textbf{x}) = \int_{\textbf{h}}q(\textbf{z}|\textbf{h})q(\textbf{h}|\textbf{x})d\textbf{h} = E_{q(\textbf{h}|\textbf{x})}[q(\textbf{z}|\textbf{h})]


\log p(\textbf{x}) \ge E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \ge E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}]

We will prove that left and right inequality are satisfied.

Right inequality

We start with the right inequality by simplifying the expression:

E_{q(\textbf{z}|\textbf{x})}[\log p(\textbf{x},\textbf{z})]- E_{q(\textbf{z}|\textbf{x})}[q(\textbf{z}|\textbf{h})] \ge E_{q(\textbf{z}|\textbf{x})}[\log p(\textbf{x},\textbf{z})]-E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{x})]

From Lemma 0: if we replace f(x) with q(\textbf{z}|\textbf{x}) and g(x) with q(\textbf{z}|\textbf{h}), we will end up with the following inequality:

E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{x})] \ge E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{h})]

This shows that the right inequality is satisfied.

Left inequality

We expand the encoder function q(\textbf{z}|\textbf{x}) as:

E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(z|\textbf{x})}] = \int_\textbf{z} q(\textbf{z}|\textbf{x})\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}d\textbf{z} \\ = \int_\textbf{z} \int_h q(\textbf{z}|\textbf{h})q(\textbf{h}|\textbf{z})\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}d\textbf{z} d\textbf{h} \\ = E_{q(\textbf{z}|\textbf{h})}E_{q(\textbf{h}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}]

According to the Jensen’s inequality:

E_{q(\textbf{h}|\textbf{x})}E_{q(\textbf{z}|\textbf{h})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \le \log E_{q(\textbf{h}|\textbf{x})}E_{q(\textbf{z}|\textbf{h})}[\frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \\ = \log E_{q(\textbf{h}|\textbf{x})}\int_{\textbf{z}} q(\textbf{z}|\textbf{h})[\frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}]d\textbf{z} \\ = \log E_{q(\textbf{h}|\textbf{x})}[p(\textbf{x})] = \log p(\textbf{x})

The left inequality is also satisfied.


I went over the proof presented in the paper, “Denoising Criterion for Variational Auto-Encoding Framework”. The simple proof on adding one extra stochastic layer shows that we get a tighter lowerbound. The original paper also generalizes its claim to L stochastic layers. By following the same proof strategy, they show that the lowerbound will be tighter as we add more stochastic layers.


[1] Im, Daniel Jiwoong, et al. “Denoising Criterion for Variational Auto-Encoding Framework.” AAAI. 2017.

[2] Kingma, Diederik P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in Neural Information Processing Systems. 2016.

[3] Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP.” arXiv preprint arXiv:1605.08803 (2016).


NNets Methods for NLP – CH8: From Textual Features to Inputs

I highlighted the key concepts from chapter 8 of “Neural Network Methods for NLP” by Yoav Goldberg.

One hot encoding: a sparse feature vector, assign a unique dimension for each possible feature. Each row of W corresponds to a particular feature.

Dense encoding: a dimension is smaller than the number of features. The matrix W is much smaller. It has a better generalization. Can use the pre-trained embedding from a larger text corpus.

Windows-based features: represent a local structure around the focus word.

  • Concat all surrounding words if we care the word position.
  • Sum or average word vectors if we do not care the word position.
  • Use weight sum if we somewhat care the word position.

CBOW – an average of word vectors.

Padding: add a special symbol to the vocabulary e.g. beginning or ending indicators.

Unknown word: a special token represents an unknown word.

Word signature: a fine-grained strategy to deal with unknown words. E.g. any rare word ending with ‘ing’ is replaced with *__ING*; any number is replaced with a *NUM* token.

Word Dropout

  • Replace some infrequent features (words) with an unknown token. But we lose some information.
  • Randomly replace a word with an unknown token. This replacement is based on word frequency. One possible formula is \frac{\alpha}{c(w) + \alpha} where \alpha is dropout aggressiveness.

Word dropout as regularization

Apply word dropout to all words, ignoring the word frequency. Use Bernoulli trial.


Chapter 8, “Neural Network Methods for NLP”, 2nd edition, Yoav Goldberg.


Restricted Boltzmann Machine (RBM)

The classical work in a generative model that has a connection to an artificial neural network is the Boltzmann Machine [1] and Restricted Boltzmann Machine (RBM) [2]. The deep version of RBM [3] shows many successes in pretraining the parameters of the neural networks. There are many good tutorials on RBM and source code that we can find online, my goal for this post is to point out some important aspects of RBM as my review material before posting about more complicated models such as NADE [4].

My favorite tutorial on RBM is by Fischer and Christian [5]. It covers many key ideas for modeling and training RBM. My post is based on this tutorial paper.

What I understand about the RBM is that we are modeling a joint probability of visible and hidden units, p(\bf v, \bf h). RBM removes all connections between hidden units, and result in an effective learning because all hidden units are conditional independence given visible units.


A graphical model of RBM. Blue nodes are visible units and Yellow nodes are hidden units. We want to infer hidden units from visible units.

One way to define the joint probability is to use Gibbs distribution and it defines p(\bf v, \bf h) = \frac{1}{Z} e^{-E(\bf v, \bf h)}, where Z is a normalization constant or a partition function, Z = \sum_{\bf v, \bf h} e^{-E(\bf v, \bf h)}. It important to realize that computing the normalization constant is intractable because we have to enumerate all possible \bf v and \bf h.

The gradient of log-likelihood is \log L(\bf \theta | \bf v) =  \log p(\bf v; \bf \theta) = \log \frac{1}{Z} \sum_{\bf h} e^{-E(\bf v, \bf h)} =\log \sum_{\bf h} e^{-E(\bf v, \bf h)}  -\log \sum_{\bf v, \bf h} e^{-E(\bf v, \bf h)}. Once we take a derivative w.r.t. to the model parameter, \bf \theta, the gradient is:

\frac{\partial \log L(\bf \theta | \bf v)}{\partial \bf \theta} = - \sum_{\bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial \theta} + \sum_{\bf v, \bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial \theta} This expression shows that the gradient is the difference between an expected energy under the model distribution and under the posterial distribution. Directly evaluating these summation is intractable.

RBM defines an energy function as E(\bf v, \bf h) = -\bf h^T W \bf v - \bf b^T \bf v - \bf c^T \bf h We can derive the gradient w.r.t. to w_{ij}. First, we derive \sum_{\bf h} p(\bf h | \bf v) \frac{\partial E(\bf v, \bf h)}{\partial w_{ij}} = \sigma( \sum_{j=1}^m w_{ij}v_j + c_i)v_j = p(H_i = 1|\bf v)v_j . Then, the gradient is become:

 \frac{\partial \log L(\bf \theta | \bf v)}{\partial w_{ij}} = p(H_i=1|\bf v)v_j - \sum_{\bf v} p(\bf v)p(H_i = 1|\bf v)v_j

We can also show that p(H_i=1| \bf v) is a sigmoid function. A similar derivation can be applied for p(V_i =1|\bf h) = \sigma(\sum_{i=1}^n w_{ij}h_i + b_j) . Due to these facts, RBM can be interpreted as a stochastic neural network where the probability of a hidden node being one is dictated by p(H_i =1 | \bf v).

Training RBM is tricky because we can’t directly evaluate the expected term. One common approximation is to use  Gibbs sampling to sample the most probable hidden and variable units from p(\bf v, \bf h) Then, we can approximate the second term. However, running Gibbs sampling requires many iterations in order to reach a stationary point, Hinton proposed a contrastive divergence where running Gibbs sampling only for k steps.

RBM is the classical stochastic neural network where it models a joint distribution between visible and hidden units. Training RBM is tricky and requires an approximation of the second log-likelihood term. The modern models such as NADE and VAE turn stochastic layers in a neural network to a deterministic system and thus training can be done through a standard backpropagation. This might be one of the reasons why NADE and VAE are more popular than RBM.


[1] Hinton, Geoffrey E., and Terrence J. Sejnowski. “Optimal perceptual inference.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE New York, 1983.

[2] Hinton, Geoffrey E. “Training products of experts by minimizing contrastive divergence.” Neural Computation 14.8 (2002): 1771-1800.

[3] Salakhutdinov, Ruslan, and Geoffrey E. Hinton. “Deep Boltzmann Machines.” AISTATS. Vol. 1. 2009.

[4] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” Journal of Machine Learning Research 17.205 (2016): 1-37.

[5] Fischer, Asja, and Christian Igel. “An introduction to restricted Boltzmann machines.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (2012): 14-36.

Review some Information Theory

Here are my note on some concepts from self-study on Information Theory:

The Shannon information content:

h(x) = \log_2 \frac{1}{p(x)}

This is a measurement of the information content of the event x = a. For example, when we flip a coin, an event of landing a head is 0.5, then h(x=’head’) = 1. It means that we only need one bit for this event. E.g. If we landed a tail, we know that it won’t be a head. One bit is sufficient.

Entropy: an average of Shannon information:

H(X) = -\sum_x p(x) \log_2 p(x)  = E[- \log_2 p(x) ]

We consider all events and average out all the Shannon information. Entropy can be thought as an uncertainty. The higher entropy, the more uncertainty of the event. This implies that the Entropy will be maximum when p(x) is a uniform distribution since we can’t make any prediction.

The Relative entropy (KL Divergence)

KL(P||Q) = \sum_x p(x) \log_2 \frac{p(x)}{q(x)} = E[\log_2 \frac{p(x)}{q(x)}]

KL(P||Q) is always non-negative (Check Gibbs’ inequality). This term will show up often in many machine learning models. It measures how much distribution q(x) diverges from p(x).

The Conditional entropy of X given Y:

This is an average uncertainty that remains about x when y is known. If I know y, then how much I still don’t know about x.

H(X|Y=b_k) = - \sum p(x|y=b_k) \log_2 p(x|y=b_k)

If we average over y:

$latext H(X|Y) = \sum_y p(y) [-\sum_x p(x|y) \log_2 p(x|y) ] = -sum_{x,y} p(x,y) \log p(x|y) $

Chain rule: 

h(x,y) = h(x) + h(y|x)

Information content of x and y is the information content of x plus information coent of y given x.

Entropy Chain rule:

H(X,Y) = H(x) + H(Y|X)

An uncertainty about X and Y is the uncertainty of X plus uncertainty of Y given X.

Mutual Information:

This term appears often in information retrieval and some machine learning models. This term measures the average reduction in uncertainty about x that results from learning the value of y. E.g. How much I will learn about x once I know about y ? What is an average amount of information that y conveys about x?

I(X; Y) = H(X) - H(X|Y)

I(X; Y) = I(Y; X) \ge 0

When we add more data, we always decrease uncertainty.

H(X,Y) \le H(X) + H(Y)

H(Y|X) = H(X,Y) - H(X) \le H(Y) + H(X) - H(X) \le H(Y)


[1] MacKay, David JC. Information theory, inference and learning algorithms. Cambridge university press, 2003.