Recurrent Recommender Networks (WSDM’17)

The motivation of this work is to tackle the collaborative filtering problem in the realistic setting. The classical collaborative filtering models interpolate rating based on the past and future rating. But in a real-world situation, there are no future rating scores. Therefore, to be able to extrapolate or predict the future rating is more practical.

One important argument that explains why many CF models had performed well on the Netflix dataset is due to the mixing distribution of training and testing data. The models were fed with future ratings; hence, it is easy to predict a user rating.

Therefore, modeling the temporal and causal aspects of the rating data are the main goal of this work. They gave an example of the movie ‘Plan 9’ which initially was reviewed as a bad film but became very popular later. Another observation is that some movies are more popular during the Christmas and summer. It is also valid to assume that a user preference will change over time as they grow up their taste of films will change.

With all these motivations and observations, they propose to use RNN to model user and movie dynamics over time and hope that RNN will capture both exogenous and endogenous dynamics. The key ingredient of their models is to incorporate a wall clock (time index) as part of the sample features. Here is how each training sample looks like:

s_t = [x_t, 1_{\text{newbie}}, \tau_t, \tau_{t-1}]

x_t is a vector of user rating up to time t.  x_{jt} = k represents this user has rated movie j at time t with a rating of k. x_{jt} = 0 when this user did not rate the movie j.1_{\text{newbie}} seems to be an indicator if a user has no previous rating – a new user. The next two parameters are important because the RNN will use time index to handle no-rating steps.

Another important component is a projecting function of s_t to an embedding space and feeds the embedding vector to an LSTM unit. Adding a linear transformation can be viewed as converting raw data into more abstract representation. This also implies that this model does not feed user ratings to an LSTM unit directly. The LSTM is used to model user and movie dynamics. We can look the trained LSTM as a function that model these dynamics. The authors trained two RNN models: one for users and another for movies.

Finally, at each time step, the model predicts a rating as follows:

\hat{r} = f(u_{it}, m_{jt}, u_i, m_j) = <\tilde u_{it}, \tilde m_{jt}> + <u_i, m_j>

This equation extends a standard matrix factorization with dynamic states \tilde u_{it}, \tilde m_{jt}. It means that at each time step, this model will solve a matrix factorization based on the rating up to time t.

To train this model requires an alternate training since we can’t train user and movie simultaneously. Otherwise, there will be too many RNN for all movies. Thus, the author fixes the movie dynamic function, and train user dynamic function. Then, fix the user dynamic function and train movie dynamic function alternately. Training the model this way will be more scalable.

The experimental results show that this model (RRN) beats TimeSVD++, AutoRec, and PMF. Further, this model can capture many external factors such as rating scale changes in Netflix dataset, external factor such as Oscar or Golden Globe awards, and internal factor such as season change.

My 2-cent, I like this paper because the motivation is well written and I can see the benefit of modeling the dynamic systems in user and movie. I am surprised that there are not many related works that attempt to solve extrapolating problem.