NNets Methods for NLP – CH8: From Textual Features to Inputs

I highlighted the key concepts from chapter 8 of “Neural Network Methods for NLP” by Yoav Goldberg.

One hot encoding: a sparse feature vector, assign a unique dimension for each possible feature. Each row of W corresponds to a particular feature.

Dense encoding: a dimension is smaller than the number of features. The matrix W is much smaller. It has a better generalization. Can use the pre-trained embedding from a larger text corpus.

Windows-based features: represent a local structure around the focus word.

  • Concat all surrounding words if we care the word position.
  • Sum or average word vectors if we do not care the word position.
  • Use weight sum if we somewhat care the word position.

CBOW – an average of word vectors.

Padding: add a special symbol to the vocabulary e.g. beginning or ending indicators.

Unknown word: a special token represents an unknown word.

Word signature: a fine-grained strategy to deal with unknown words. E.g. any rare word ending with ‘ing’ is replaced with *__ING*; any number is replaced with a *NUM* token.

Word Dropout

  • Replace some infrequent features (words) with an unknown token. But we lose some information.
  • Randomly replace a word with an unknown token. This replacement is based on word frequency. One possible formula is \frac{\alpha}{c(w) + \alpha} where \alpha is dropout aggressiveness.

Word dropout as regularization

Apply word dropout to all words, ignoring the word frequency. Use Bernoulli trial.

References:

Chapter 8, “Neural Network Methods for NLP”, 2nd edition, Yoav Goldberg.