A Bag of words (BOW) is probably one the most common document representation for a textual dataset. This feature has a fixed length, which can be conveniently trained by many off the shelf machine learning algorithms. It is an intuitive representation and yet accurate enough for many practical problems. Its extension such as  TF-IDF and Okapi BM25 are competitive representation in many information retrieval tasks.

However, the word that appears early on in the sentence is probably influenced later words. This sequential information is discarded entirely by BOW representation. Another issue is it does not capture a semantic distance since it is only a word count vector.

Word2Vec is a neural network that learns a vector representation for each word in the corpus. By training the model to predict the next words in the sentence giving earlier words, the model can find a vector representation for each word. This euclidean distance between these vectors is similar to semantic distance: similar words will be nearby.

When we are working with documents, we might also want to turn each document into a fixed length vector. But how can we find a vector representation of a document? The paper by Quoc Le explained how to train the neural network to find these representations.

The idea is to think of a document vector as a document summary. Then, during the training, the model is asked to predict the next words giving earlier words in the paragraph and a document summary. For an instance, if we want to predict the next word for “We prepare food for our ___”, the next words can be “baby”, “neighbours”, “roommate”, etc. However, if we are also told that this paragraph is about a pet, then we can make a better guess that the next word is related to animals.

This model basically extends word2vec model to a longer context. It seems to work well based on their experimental results. This model shows a connection with a memory network paper: by providing an extra relevant information (summary or supporting fact), the model will make a better prediction. The difficulty is to find such a relevant and effective fact. For this work, they train the model the same way as word2vec.


[1] Le, Quoc V., and Tomas Mikolov. “Distributed Representations of Sentences and Documents.” ICML. Vol. 14. 2014.