# Factorization Machine

I’ve stumbled upon this paper that focused on predicting a response variable on a sparse dataset. In a standard regression problem, we want to find a weight vector that transforms a feature vector to a response value.

$\hat y = w_0 + \bf{w}^T \bf{x}$

This linear equation assumes that each sample, $x^{(1)}, x^{(2)}, \cdots, x^{(n)}$ are independent. This assumption may not be valid for the collaborative filtering or ranking problems where the observation is correlated. Furthermore, SVM cannot find a reliable hyperplane due to the data sparsity; thus, we need the model that is reliable on this setting. Factorization Machine (FM) is also a general predictor and [1] shows that many predicting problems are equivalent to FM.

FM models all interactions in the dataset. This approach increases the number of observations. If we model a pair-wise interaction, there are $O(n^2)$ interactions. The FM model equation is:

$\hat y(\bf{x}) = w_0 + \sum_{i=1}^n w_ix_i + \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j$

The first two terms are regression formulation. The third term is an interaction term. The dot product of feature embedding vectors $\bf{v}_i, \bf{v}_j$ is a weight of interaction between feature i and feature j. This formulation implies that each feature is not independent.

The author shows that a 2-way FM is equivalent to SVM with polynomial kernel $K(\bf{x},\bf{z}) = <\phi(\bf{x}),\phi(\bf{z})>$. The only difference is that the interaction weight parameters $w_{i,j}$ in SVM is dense matrix, but it is a low-rank matrix under FM framework. This lead to the less parameters to estimate. If $W \in R^{n,n} = VV^T$ where $V \in R^{n,k}$ and the number of parameters is $O(n^2)$ v.s. $O(nk)$. If k << n, then the term $O(nk)$ is almost a linear.

Some people pointed out that the computation of FM is $O(n^2)$, which is not the case. If the input matrix is sparse, then the number of non-zero interactions in this term:  $latex \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j$ is actually much small than $O(k \cdot n^2)$. If $m(n)$ is the number of non-zero entries, then the computation of the interaction term is $O(k \cdot m(n)$. As long as $m(n) << n$ then FM’s computation is almost linear.

Nevertheless, the input feature for FM still needs some feature engineering. The choice of features is important and feature normalization is necessary.

References:

Rendle, Steffen. “Factorization machines.” Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010.