I highlighted the key concepts from chapter 8 of “Neural Network Methods for NLP” by Yoav Goldberg.
One hot encoding: a sparse feature vector, assign a unique dimension for each possible feature. Each row of W corresponds to a particular feature.
Dense encoding: a dimension is smaller than the number of features. The matrix W is much smaller. It has a better generalization. Can use the pre-trained embedding from a larger text corpus.
Windows-based features: represent a local structure around the focus word.
- Concat all surrounding words if we care the word position.
- Sum or average word vectors if we do not care the word position.
- Use weight sum if we somewhat care the word position.
CBOW – an average of word vectors.
Padding: add a special symbol to the vocabulary e.g. beginning or ending indicators.
Unknown word: a special token represents an unknown word.
Word signature: a fine-grained strategy to deal with unknown words. E.g. any rare word ending with ‘ing’ is replaced with *__ING*; any number is replaced with a *NUM* token.
Word Dropout
- Replace some infrequent features (words) with an unknown token. But we lose some information.
- Randomly replace a word with an unknown token. This replacement is based on word frequency. One possible formula is where is dropout aggressiveness.
Word dropout as regularization
Apply word dropout to all words, ignoring the word frequency. Use Bernoulli trial.
References:
Chapter 8, “Neural Network Methods for NLP”, 2nd edition, Yoav Goldberg.