Factorization Machine (ICDM’10)

I’ve stumbled upon this paper that focused on predicting a response variable on a sparse dataset. In a standard regression problem, we want to find a weight vector that transforms a feature vector to a response value.

\hat y = w_0 + \bf{w}^T \bf{x}

This linear equation assumes that each sample, x^{(1)}, x^{(2)}, \cdots, x^{(n)} are independent. This assumption may not be valid for the collaborative filtering or ranking problems where the observation is correlated. Furthermore, SVM cannot find a reliable hyperplane due to the data sparsity; thus, we need the model that is reliable on this setting. Factorization Machine (FM) is also a general predictor and [1] shows that many predicting problems are equivalent to FM.

FM models all interactions in the dataset. This approach increases the number of observations. If we model a pair-wise interaction, there are O(n^2) interactions. The FM model equation is:

\hat y(\bf{x}) = w_0 + \sum_{i=1}^n w_ix_i + \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j

The first two terms are regression formulation. The third term is an interaction term. The dot product of feature embedding vectors \textbf{v}_i, \textbf{v}_j is a weight of interaction between feature i and feature j. This formulation implies that each feature is not independent.

The author shows that a 2-way FM is equivalent to SVM with polynomial kernel K(\textbf{x},\textbf{z}) = <\phi(\textbf{x}),\phi(\textbf{z})>. The only difference is that the interaction weight parameters w_{i,j} in SVM is dense matrix, but it is a low-rank matrix under FM framework. This lead to the less parameters to estimate. If W \in R^{n,n} = VV^T where V \in R^{n,k} and the number of parameters is O(n^2) v.s. O(nk). If k << n, then the term O(nk) is almost a linear.

Some people pointed out that the computation of FM is O(n^2), which is not the case. If the input matrix is sparse, then the number of non-zero interactions in this term:  \sum_{i=1}^n \sum_{j=i+1}^n \textbf{v}_i^T \textbf{v}_j x_i x_j is actually much small than O(k \cdot n^2). If m(n) is the number of non-zero entries, then the computation of the interaction term is O(k \cdot m(n). As long as m(n) << n then FM’s computation is almost linear.

Nevertheless, the input feature for FM still needs some feature engineering. The choice of features is important and feature normalization is necessary.


Rendle, Steffen. “Factorization machines.” Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010.