This paper proposed a content-based recommendation model using attention mechanism. The task is to recommend video or image to a user based on user’s interaction with the content such as view counts or clicks. This is an implicit feedback problem.

The key motivation is that in the video or image recommendation, users do not like the entire video or images but only pay attention to the subset of the content. For the video, only some frames that users like. For an image, only some part of the images are interesting to the user. The attention mechasim

The model resembles SVD++ because the overall rating has both latent factor model and neighborhood model:

The first term is a standard latent factor model: finding an embedding vector for user and item. The second term is a neighborhood model: the more similar between item j and some items that user i likes, the higher rating prediction.

The difference is: SVD++ uses a uniform weight but this model learns weight from the data. Furthermore, the neighborhood model does not compute a dot product between two embedding item vectors but *between item vector and auxiliary item vector*. This vector is used to characterize users based on the set of items they interacted with.

The model architecture basically has 3 parts. The first part is to compute an item vector based on a weight combination of all components. E.g. find a vector representation for a video that has 100 frames. This representation is a personalized item representation because the weight (attention) is computed from a nonlinear function that takes a user and all item components.

The second part is to compute a neighborhood item vector, . The attention is computed from a function that takes the user i, item l, auxiliary tiem l, . Then, this neighborhood vector will be added to a user vector i.

The final part performs a pair-wise learning. This paper use BPR (Bayesian Personalized Ranking) to minimize the rank loss between positive and negative samples. The evaluation metric use leave-one-out which is to keep the user’s last interaction with an item as the test set. Then, they measure hit rate and NDCG.

The hit rate of the proposed model is better than SVD++. I think the author should include FISM as a baseline since this model also performs well on implicit feedback task. Furthermore, SVD++ does not perform well on Top-N task since it optimized for RMSE metrics. Hence, SVD++ is not a strong baseline.

Overall, what I like about this work is the hierarchy model that construct a personalized item representation based on an attention. This idea makes a lot of sense because (1) people focus on different part of the content. The item vector should then be personalized; (2) This model has a lot of parameters so it works well on the large video and image dataset.

**References:**

http://www.comp.nus.edu.sg/~xiangnan/papers/sigir17-AttentiveCF.pdf