Factorization Machine (ICDM’10)

I’ve stumbled upon this paper that focused on predicting a response variable on a sparse dataset. In a standard regression problem, we want to find a weight vector that transforms a feature vector to a response value.

\hat y = w_0 + \bf{w}^T \bf{x}

This linear equation assumes that each sample, x^{(1)}, x^{(2)}, \cdots, x^{(n)} are independent. This assumption may not be valid for the collaborative filtering or ranking problems where the observation is correlated. Furthermore, SVM cannot find a reliable hyperplane due to the data sparsity; thus, we need the model that is reliable on this setting. Factorization Machine (FM) is also a general predictor and [1] shows that many predicting problems are equivalent to FM.

FM models all interactions in the dataset. This approach increases the number of observations. If we model a pair-wise interaction, there are O(n^2) interactions. The FM model equation is:

\hat y(\bf{x}) = w_0 + \sum_{i=1}^n w_ix_i + \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j

The first two terms are regression formulation. The third term is an interaction term. The dot product of feature embedding vectors \textbf{v}_i, \textbf{v}_j is a weight of interaction between feature i and feature j. This formulation implies that each feature is not independent.

The author shows that a 2-way FM is equivalent to SVM with polynomial kernel K(\textbf{x},\textbf{z}) = <\phi(\textbf{x}),\phi(\textbf{z})>. The only difference is that the interaction weight parameters w_{i,j} in SVM is dense matrix, but it is a low-rank matrix under FM framework. This lead to the less parameters to estimate. If W \in R^{n,n} = VV^T where V \in R^{n,k} and the number of parameters is O(n^2) v.s. O(nk). If k << n, then the term O(nk) is almost a linear.

Some people pointed out that the computation of FM is O(n^2), which is not the case. If the input matrix is sparse, then the number of non-zero interactions in this term:  \sum_{i=1}^n \sum_{j=i+1}^n \textbf{v}_i^T \textbf{v}_j x_i x_j is actually much small than O(k \cdot n^2). If m(n) is the number of non-zero entries, then the computation of the interaction term is O(k \cdot m(n). As long as m(n) << n then FM’s computation is almost linear.

Nevertheless, the input feature for FM still needs some feature engineering. The choice of features is important and feature normalization is necessary.

References:

Rendle, Steffen. “Factorization machines.” Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010.

 

Advertisements