The order embedding paper mentioned the prior work to learn word embedding in a probabilistic way. This idea has intrigued me so I went back to read this ICLR paper. Not surprisingly, this pioneered paper was written by the student of Andrew McCallum. I am expecting a few surprises and an aha moment from this work.
This picture explains everything about Gaussian word embedding:
In word2vec, the model attempts to learn a fixed vector so that any word in the same sentence has similar vectors. The similarity among vectors is measured by the dot product. The probablistic embedding is a generalization of word2vec. Instead of learning only an embedding vector, the model needs to learn 2 vectors per word; namely mean and variance vectors.
The role of a variance is important. A common word has a higher chance to be appeared with more words than a more specific word. We could expect that a different variety of sentences formed by the common word; hence, its variance should be large.
With this intuition, the covariance matrix of word w could be estimated based on the distance among other words within the same sentence as word w:
- is a word vector found in the same sentence as word w
One problem with this estimation is that the model cannot capture many entailing relationships because some broader words are rarely appeared in the text corpus. For example, sentences “look at a dog” and “look at a bird” are far more common than sentence “look at a mammal”. The hierarchical relationship between mammal and dog could not be learned by this model.
Energy-based learning approach (EBM)
This learning framwork is a generalization of a probabilistic model. The probablistic model needs to “stick” with a probabilistic (normalized) distribution, but the energy-based model does not have to. In sum, in EMB framework, we first define a compatible function between a pair of words and use a gradient-based inference to keep the “energy” or score output by the compatible function as small as possible. One key advantage of EBM is its ability to incorporate a latent variable, which is something that can’t be done directly using feedforward networks. For a sake of this post, you will find many resources on EBM approach.
Compatible Function
This paper use an inner product between two Gaussian distributions as a similarity measurement, which is defined as:
To train the compatible function, the authors depend on log-likelihood estimation. They found that using rank loss or ratios of densities are less interpretable and possibly log-likelihood has a better numerical stability.
Another way to train the model is to use KL divergence as the distance measure. I am not sure if using KL divergence (an asymmetric distance) is applicable because we can force the word w to be closer to its neighbor word but not another way around. I guess when it comes to training the deep learning model, we don’t have to be too strict about some mathetical conditions such as the distance metric.
Learning
The mean needs to be kept as small while a covariance matrix must be positive definite. One constraint is to keep each element of the diagonal to be bounded the hypercube. This can be done by enforcing each element to be bounded some pre-defined range.
Evaluation
A broader word has a higher variance than a specific word.
Entailment
As mentioned earlier, this model can somewhat learn entailment directly from the source data. But it can’t learn all word relationships.
In sum:
It is obvious that the proposed model probably does not work well in the industry. But this paper provides a probabilstic perspective of word embedding. Indeed, this is an interesting paper. I always enjoy reading this paper greatly.
Reference:
You must be logged in to post a comment.