Vertical and Horizontal Feature Hierarchy (AAAI’17)

This paper uses feature hierarchy to improve recommendation accuracy. This paper aims to learn the horizontal relationship in the feature hierarchy which has not yet been exploited.

The key assumption is that items under the same category are complementary but items under the same “parent category” are alternative. For this assumption to work, the feature hierarchy must be categorized according to complementary and alternative properties.

Item Co-occurrence

The author assumes that when two products are more similar if they are rated by the same user. They define item co-occurrence as:

IC(i, j) = \frac{P(e_i \cap e_j)}{P(e_i)P(e_j)}

This formula is similar to the point-wise mutual information (PMI).

Feature influence

The influence of feature f^l to the item i is the average IC of item i and item j in the feature category f^l.

\frac{1}{|f^l|-1}\sum_{j \in f^l, i \ne j}IC(i, j)

If item i has co-occured with many items in f_l, the feature influence on i will be high and vice versa.

Item Relationship

The following definition is the key assumption of this work:


Item i and j are alternate if P(e_i|e_j) < P(e_i) and P(e_j|e_i) < P(e_j). It said that if a user consumes item j, he/she will likely not consume item i. The opposite has to be true as well.


If we know that a user consumes item j, he will likely consume item i too, then these two items are complementary.

Horizontal Direction

This work represents a feature as a bag of items. To measure the relationship between two features, the author calculates the avearge IC between all item pairs in both features.

FR(f, g) = \frac{1}{|f|x|g|}\sum_{i \in f} \sum_{j \in g} IC(i, j)

Modeling Vertical Direction

The model learns a latent vector of all features, \theta_{f^l} and then for each item i, the additional item latent vector is a linear combination between feature influence and feature latent vectors:

\tilde \theta_i = \sum_l^L \vartheta_{f^l}\theta_{f^l}

When item i and j are in the same feature categories, the item latent vector will be more similar. The feature influence acts as a weight.

Modeling Horizontal Direction

The horizontal relationship is enforced via a regularization. If two features are related, then their latent vectors must be closer in the latent space.

\Phi(\theta_f, \theta_g) = \log FR(f,g) ||\theta_f - \theta_g||^2


The co-occurrence signal is used to determine if two items are complementary or alternative. But this work does not really care about modeling complementary or alternative. They simply use the co-occurrence information to place similar products closer in the latent space.

They represent a feature as a bag of items and then they learn a feature latent vector and perform. a linear combination of feature latent vector as an alternate item representation.

They enforce the feature latent vector to be more similar if these two features are more similar.

This work assumes that item rated by the same user are more similar. If the user purchases a running shoe and bed sheet, then these two products will be more similar. Since the chance that these two products will be co-purchased are less likely, the co-occurrence information based on co-rated information proves to be useful in this work.

The model can be summarized as:

Item Latent vector = Traditional Latent vector + a weight-sum of feature vectors.


Sun, Zhu, et al. “Exploiting both Vertical and Horizontal Dimensions of Feature Hierarchy for Effective Recommendation.” AAAI. 2017.








Max function

Max function is a useful function in statistic because we often want to select the highest score from the group or query the highest value from a set of numbers.

Argmax is also as important as max function. We often see argmax in machine learning algorithm because we usually want to know which is the best item instance (with respect to the highest score) from a group of items.

But both functions are not differentiable so the gradient-based optimizer cannot handle these functions. Hence, the smooth approximation of max and argmax functions have been used so that we can employ these function in a computational graph.


We start with softmax which I think it should be named soft-argmax because this function returns a one-hot encoder of the item with the highest value.


Given a vector \textbf{x} = \{ x_1, x_2, \cdots, x_n \}, then

\text{softmax}(\textbf{x}) = \frac{1}{Z}\bigg[ \exp{(x_1)}, \exp{(x_2)}, \cdots, \exp{(x_n)} \bigg] where Z = \sum_{i=1}^N \exp{(x_i)}

The exponential function amplifies the high value to be even higher value. Together with a normalization, it pushes down other smaller numbers closer to zero. Since the sum of all entries is 1, the highest value will end up with a value closer to 1. Hence, the output vector is a sparse vector with a single entry with a value that is close to 1. This is similar to a one-hot encoder.


The max function can be approximated as follows:

\text{max}(\textbf{x}) \approx \log \sum_{i=1}^N \exp{(x_i)}

The exponential function will make the largest number become even larger so that other values are way smaller. The total sum is not different from the largest value. For example, if we have 1 billion, and a lot of one hundred, summing them up we still get the total around 1 billion.

However, the above equation causes an overflow or underflow when we put this formula on the computer. The common fix is to dampen each value by a fixed constant:

\log \sum_{i=1}^N = \log \bigg( \exp{(a)} \cdot \sum_{i=1}^N \frac{\exp{(x_i)}}{\exp{(a)}} \bigg) \\  = a + \log \bigg( \sum_{i=1}^N \exp{(x_i - a)} \bigg)

We set a to be the maximum value a = \text{max}(\textbf{x}).



Frobenius norm and Trace

I found the connection between Frobenius norm and trace operation is simple and elegant.

Frobenius norm is a square-root sum of absolute squares of every element. It defines as:

||X||_F = \sqrt{\sum_i \sum_j |a_{ij}|^2}

If all elements in the matrix is real number, then we do not need an absolute operator.

Now, we re-express matrix X such that each row is one vector:

X =  \begin{bmatrix} \vec{r_1} \\ \vec{r_2} \\ \cdots \\ \vec{r_m} \end{bmatrix}

Since the sum of square of \vec{r_i} is a dot product \vec{r_i}^T\vec{r_i}, we can compute the sum of squres of all elements by:

XX^T =  \begin{bmatrix} \vec{r_1} \\ \vec{r_2} \\ \cdots \\ \vec{r_m} \end{bmatrix}\begin{bmatrix} \vec{r_1} \\ \vec{r_2} \\ \cdots \\ \vec{r_m} \end{bmatrix}^T

Any element along the diagonal of XX^T is a sum square of element from one of the row of X

To compute the total sum squares of every element, we can sum along the diagonal:

\sum_i (XX^T)_{ii} = \text{Tr}(XX^T)

And a sum along a diagonal of matrix is a trace operation.

Why KL Divergence is always non-negative?

Here is the proof showing that the KL divergence of two distributions p and q is always non-negative.

D_{\text{KL}}(p||q) = \int_{x \in \text{supp}(p)} p(x) \log \frac{p(x)}{q(x)} dx \\  = - \int_{x \in \text{supp}(p)} p(x) \log \frac{q(x)}{p(x)} dx \\  \ge \log\int_{x \in \text{supp}(p)} p(x) \frac{q(x)}{p(x)} dx \\  \ge \log\int_{x \in \text{supp}(p)} q(x) dx \\  \ge \log 1 \\  \ge 0

The key step is to apply the Jensen’s inequality so that the logarithm will be placed outside of the integration.


VAE: Adding More Stochastic Layers gives a tighter lower-bound.

Does adding more stochastic layers to the recognition model (encoder function) give a tighter lower-bound? Daniel Jiwoong (Bengio’s student)’s AAAI paper, “Denoising Criterion for Variational Auto-Encoding Framework. (AAAI 2017)”, claims that this is true.

It has been known that multi-modal recognition models can learn a more complex posterior distribution from the input data (see [2], [3]). Intuitively, adding more stochastic layers increases the complexity of the recognition models.

Lemma 0:

This proof shows that the following inequality is true:

E_{f(x)}[\log f(x)] \ge E_{f(x)}[\log g(x)]

By using the KL divergence property, which is defined as:

D_{KL}(f || g) = \int_x f(x) \log \frac{f(x)}{g(x)} dx = E_{f(x)}[ \log \frac{f(x)}{g(x)} ] \ge 0

Since KL divergence is always non-negative (you can also prove that too.), arranging the inequality will result in the following expression:

E_{f(x)}[ \log f(x)] - E_{f(x)}[ \log g(x)] \ge 0


E_{f(x)}[ \log f(x)] \ge E_{f(x)}[ \log g(x)]

This statement says that the cross entropy for f and g is always at least the entropy of $f$. This makes sense because when we set distribution $g$ to be the same as $f$, we will get the lowest cross entropy.

Lemma 1:

A feedforward network with multiple stochastic layers can be defined as a marginal distribution of multiple latent variables:

q(\textbf{z}|\textbf{x}) = \int_{\textbf{h}}q(\textbf{z}|\textbf{h})q(\textbf{h}|\textbf{x})d\textbf{h} = E_{q(\textbf{h}|\textbf{x})}[q(\textbf{z}|\textbf{h})]


\log p(\textbf{x}) \ge E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \ge E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}]

We will prove that left and right inequality are satisfied.

Right inequality

We start with the right inequality by simplifying the expression:

E_{q(\textbf{z}|\textbf{x})}[\log p(\textbf{x},\textbf{z})]- E_{q(\textbf{z}|\textbf{x})}[q(\textbf{z}|\textbf{h})] \ge E_{q(\textbf{z}|\textbf{x})}[\log p(\textbf{x},\textbf{z})]-E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{x})]

From Lemma 0: if we replace f(x) with q(\textbf{z}|\textbf{x}) and g(x) with q(\textbf{z}|\textbf{h}), we will end up with the following inequality:

E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{x})] \ge E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{h})]

This shows that the right inequality is satisfied.

Left inequality

We expand the encoder function q(\textbf{z}|\textbf{x}) as:

E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(z|\textbf{x})}] = \int_\textbf{z} q(\textbf{z}|\textbf{x})\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}d\textbf{z} \\ = \int_\textbf{z} \int_h q(\textbf{z}|\textbf{h})q(\textbf{h}|\textbf{z})\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}d\textbf{z} d\textbf{h} \\ = E_{q(\textbf{z}|\textbf{h})}E_{q(\textbf{h}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}]

According to the Jensen’s inequality:

E_{q(\textbf{h}|\textbf{x})}E_{q(\textbf{z}|\textbf{h})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \le \log E_{q(\textbf{h}|\textbf{x})}E_{q(\textbf{z}|\textbf{h})}[\frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \\ = \log E_{q(\textbf{h}|\textbf{x})}\int_{\textbf{z}} q(\textbf{z}|\textbf{h})[\frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}]d\textbf{z} \\ = \log E_{q(\textbf{h}|\textbf{x})}[p(\textbf{x})] = \log p(\textbf{x})

The left inequality is also satisfied.


I went over the proof presented in the paper, “Denoising Criterion for Variational Auto-Encoding Framework”. The simple proof on adding one extra stochastic layer shows that we get a tighter lowerbound. The original paper also generalizes its claim to L stochastic layers. By following the same proof strategy, they show that the lowerbound will be tighter as we add more stochastic layers.


[1] Im, Daniel Jiwoong, et al. “Denoising Criterion for Variational Auto-Encoding Framework.” AAAI. 2017.

[2] Kingma, Diederik P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in Neural Information Processing Systems. 2016.

[3] Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP.” arXiv preprint arXiv:1605.08803 (2016).


NNets Methods for NLP – CH8: From Textual Features to Inputs

I highlighted the key concepts from chapter 8 of “Neural Network Methods for NLP” by Yoav Goldberg.

One hot encoding: a sparse feature vector, assign a unique dimension for each possible feature. Each row of W corresponds to a particular feature.

Dense encoding: a dimension is smaller than the number of features. The matrix W is much smaller. It has a better generalization. Can use the pre-trained embedding from a larger text corpus.

Windows-based features: represent a local structure around the focus word.

  • Concat all surrounding words if we care the word position.
  • Sum or average word vectors if we do not care the word position.
  • Use weight sum if we somewhat care the word position.

CBOW – an average of word vectors.

Padding: add a special symbol to the vocabulary e.g. beginning or ending indicators.

Unknown word: a special token represents an unknown word.

Word signature: a fine-grained strategy to deal with unknown words. E.g. any rare word ending with ‘ing’ is replaced with *__ING*; any number is replaced with a *NUM* token.

Word Dropout

  • Replace some infrequent features (words) with an unknown token. But we lose some information.
  • Randomly replace a word with an unknown token. This replacement is based on word frequency. One possible formula is \frac{\alpha}{c(w) + \alpha} where \alpha is dropout aggressiveness.

Word dropout as regularization

Apply word dropout to all words, ignoring the word frequency. Use Bernoulli trial.


Chapter 8, “Neural Network Methods for NLP”, 2nd edition, Yoav Goldberg.


Node2Vec: Scalable feature learning for networks (KDD’16)


This paper proposes a generalization of DeepWalk [2] and Line [3].

Key Ideas

The key observation is that the role of the vertex is also important. For example, given two vertices that are far apart from each other but share similar kind of vertices. Then, they both have similar roles (e.g. hubs or bridges). Breadth-First-Search traversal can capture this graph structure. On the other hands, the community is described as reachability/closeness of the two nodes in the graph. For instance, in the social network, my friend’s friend’s friend has a higher chance to belong to the same community as me.


Similar Works

DeepWalk uses a random walk to create a set of vertices that represents a context around the vertex of interest. The path generated by the random walk will either lead to the very faraway node or nodes around the seed. The context that is captured by this process can be unpredictable and depends on the graph structure.

Line focuses on neighbor vertices, which is the same as Breadth-First-Search traversal (BFS). In this case, it captures the local community in the graph.


It comes down to what is a proper way to define the walk so that we can capture both community structure and role-based structure. Node2Vec defines a general method of graph traversal that is controlled by 2 parameters, p and q.

The key difference between BFS and DFS samplings is that the BFS is better at exploring the local neighborhoods while DFS is good at exploring larger parts of the network. BFS is viewed as micro-view of the graph whereas DFS characterizes the macro-view of the graph. The authors believe that the mixture of these two classic sampling will improve the representation of the graph embedding.

Search bias \alpha


The unnormalized transitional probability from node v to x is:

\alpha_{p,q}(t, x) = \frac{1}{p} if d_{tx} = 0

\alpha_{p,q}(t, x) = 1 if d_{tx} = 1

\alpha_{p,q}(t, x) = \frac{1}{q} if d_{tx} = 2

Where t is the previously visited node, v is the current node, and x is the next node. The distance d_{tx} determines the type of visiting nodes. When the distance between the previous node t and the next node x is zero, it means that we return to the node t.

However, when the distance is 1, it means that we want to visit the node that is directly connected to current node v and the previous node t. It means that node x is shared node between v and t. Hence, this walk will capture the structure of the network (local view).

Lastly, when the distance is 2, we want to hop further away from node t. This is similar to DFS where we want to go deeper into the network graph.

Parameter p and q will control the characteristic of bias walking. The high value of p means that we don’t want to go back often. The high value of q means that we do not want to make too many hops. Hence, p and q control the balance between BFS and DFS sampling.


This paper generalizes the random walk by adding parameters to control the walk characteristic. I think this is a neat idea because some information networks may need a specific walk than a general random walk. Thus, this model allows us to define the context based on how much we want to explore the network versus how much we want to exploit the local structure.


[1] Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016.

[2] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

[3] Tang, Jian, et al. “Line: Large-scale information network embedding.” Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015.