Embedding-based Query Language Models (ICTIR’16)

This paper proposes a more accurate query language model for query expansion task based on word embedding.

The key contribution is their observation that the cosine similarity score between the top words with the query word is not too distant from the similarity score between the 1000th word with the same query. They use a sigmoid function to amplify the similarity score.

The rest of the paper proposes two query language models.

The first model

The first model assumes the conditional independence between the query terms.

The model is:

p(w|\theta_q) \propto p(\theta_q|w)p(w) = p(w)\prod_{i=1}^kp(q_i|w)

It defines: p(q_i|w) = \frac{\delta(q_i,w)}{\sum_{w'\in V} \delta(w',w)}

And p(w) = \sum_{w' \in V} p(w, w') \propto \sum_{w' \in V} \delta(w, w')

The second model

p(w|\theta_q) = \sum_{w'\in V} p(w,w'|\theta_q)p(w'|\theta_q)

This model now assumes that query and term similarity is independent:

$latex p(w,w’|\theta_q) = p(w|w’) = \frac{\delta(w,w’)}{\sum_{w”\in V} \delta(w”, w’)}$

and the second term uses MLE to estimate the probability:

p(w'|\theta_q) = \frac{C(w'|Q)}{|Q|}

The paper also talked about the embedded-based relevance model. The existing model by the same author is now extent with the proposed embedded query language model.


[1] Zamani, Hamed, and W. Bruce Croft. “Embedding-based query language models.” Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval.



Pseudo-Relevance Feedback based on Matrix Factorization (CIKM’16)

This work uses pseudo-relevance feedback collections to recommend top words and their weights to the input query.

It starts with creating a document-word matrix where each row each retrieved documents and each column is term weight. The first row is the input query’s term weights. The rest of documents either come from relevant feedback or top-K retrieved documents (pseudo-relevant feedback).

Then, we perform a non-negative matrix factorization (NMF) on the matrix. The first row of the reconstructed matrix is a re-estimate query and its top m terms will be used query expansion terms.


However, this model needs to re-compute NMF for every query, this can be expensive operations. Thus, the paper recommends creating a small document-word matrix (e.g use only top 10 retrieved documents).

What makes this model stands out is that the model not only considers the words that discriminate the top documents from the collection, it only considers the relevance of those words to the original query.

I am wondered if calculate the NMF on such a small matrix will really work. The empirical results demonstrate that this idea works but its performance is not impressive. But I think using NMF for a query expansion task is a neat idea.



Generating images from captions with attention (ICLR’16)


This work extends the original DRAW paper [2] to generate images given captions. We can treat this model as conditional DRAW. That is we model a conditional probability of P(\text{image}|\text{caption}) The additional textual data controls where to read and write the image.


Generating images from text descriptions is a structure prediction task. That is given a sequence of words, we want to generate an image. Although AlignDRAW has borrowed the same approach as DRAW by combining progressive refinement with attention, incorporating text sequence is their contribution.

The latent variable in DRAW model is sampled from spherical Gaussians, z_t \sim \mathcal{N}(Z_t|\mu_t, \sigma_t) where its mean and variance are functions of the current hidden state of the encoder, e.g. \mu_t = W(h_t^{\text{enc}}). However, AlignDRAW adds dependency between latent variables: z_t \sim \mathcal{N}(\mu(h_{t-1})^{\text{gen}}, \sigma(h_{t-1}^{\text{gen}}).

During the image generation, DRAW iteratively samples a latent variable z_t from a prior \mathcal{N}(0, I), but AlignDRAW will draw z_t from P(Z_t|Z_{<t}). It means that there is a dependency between each latent vector in AlignDRAW model.


Align Operator

The input caption is fed to the BI-Directional RNN. Each output from each time-step needs to be aligned with the current drawing patch. Attention weight is then learned from caption representation up to k words and the current hidden state of the decoder h_{t-1}^{\text{gen}}. Finally, compute the weight average of all hidden state of the language model to obtain the caption context, s_t. This context together with a latent vector z_t will be fed to the LSTM decoder.

Objective Function

This model maximizes the expectation of the variational lowerbound. There are 2 terms: the data likelihood and KL loss.

Closing Thoughts

AlignDRAW uses bi-directional LSTM with attention to aligning each word context with the patches in the image. Some generated images from caption are interesting such as ‘A herd of elephants walking across a dry grass field’. The model generalizes the training data and able to generate novel images.


[1] Mansimov, Elman, et al. “Generating images from captions with attention.” arXiv preprint arXiv:1511.02793 (2015).

[2] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).

DRAW: A Recurrent Neural Network For Image Generation (ICML’15)

This paper proposes a new method for image generation by progressively improve the reconstructed image.

The previous image generation models generate the entire image by learning a sampling function (GANs), distribution over a latent vector (VAE), or generate one pixel at a time (PixelRNN, PixelCNN). Although the generated images from these models are in a good quality, these models are forced to learn a complicated and high-dimensional distribution. For example, to generate a car image, the models need to approximate the distribution of all possible cars. This is a difficult task.


(Note: I took this Figure from the original paper)

Incremental Update

Progressive refinement breaks down the complex distribution into a chain of conditional distribution:

P(X, C_T) = P(X|C_T)P(C_T) = P(X|C_T)P(C_T|C_{T-1}) \cdots P(C_1|C_0)

Therefore, estimating a conditional distribution is much easier. The conditional probability is modeled by the standard LSTM.

Latent Variable

Use VAE framework helps us project the input image which has a high dimension into a low-dimensional space. Working on the smaller latent space is much easier than the original image space.

Attention Mechanism

The progressive refinement through LSTM has simplified the complex distribution through time, then the attention mechanism simplifies the spatial data into a smaller patch. The encoder and decoder now only needs to deal with a small fraction of the image instead of the image as a whole. This idea again reduces the input space by focusing on the important part of the image only.

Read and Write Operations

This part can be intimated to read at the first glance due to the use of the Gaussian filters. There are many nice blogs that described Read and Write operations with attention mechanism in detail. The main idea is the Read operation crops the input image. The Write operation draws a patch to the canvas matrix.


This is a must-read paper. The combine of progress refinement through time with attention mechanism is a nice idea to simplify the complex image distribution. This is one of the early paper that combine RNN with attention to handle the spatial data such image. I think this is an amazing accomplishment.


[1] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).


DocNADE (JMLR 2016)

This work extends Neural Autoregressive Distribution Estimation (NADE) for a document modeling.


The key idea of NADE is each hidden and output vectors are modeled as a conditional probability of previously seen vectors:

p(\textbf{v}) = \prod_{i=1}^D p(v_i | \textbf{v}_{<i})

\textbf{h}( \textbf{v}_{<i}) = g(c + W_{:,<i}\textbf{v}_{<i})

Then, the probability of the output is:

p(v_i=1|\textbf{v}_{<i}) = \sigma(b_i + V_{i,:}\textbf{h}_i(\textbf{v}_{<i}))


NADE has a set of separated hidden layers, each represents the previously seen context. However, NADE is not applicable for a variable length input such as a sequence of words.


DocNADE model tackles a variable length input issue by computing the hidden vector as follows:

\textbf{h}( \textbf{v}_{<i}) = g(c + \sum_{k<i} W_{:,v_k})

Each word v_k is an index in the vocabulary of fixed length. Each column of matrix W is a word embedding. Hence, a summation of word vectors represents a previous word context. This does not preserve the word order since the model simply sums all word vectors.


The output layer requires a softmax function to compute the word probability. A hierarchy softmax is necessary to scale up this calculation.

DocNADE Language Model

The previous model may not suitable for language model because it focuses on learning a semantic representation of the document. The hidden layer now needs to pay more attention to the previous terms. It can be accomplished by using n-gram model:

\textbf{h}_i(\textbf{v}_{<i}) = g(\textbf{b} + \textbf{h}_i^{DN}(\textbf{v}_{<i}) + \textbf{h}_i^{LM}(\text{v}_{<i}))

The additional hidden unit $latex \textbf{h}_i^{LM}$ models a n-gram language model:

$latex \textbf{h}_i^{LM}(\text{v}_{<i}) = \sum_{k=1}^{n-1}U_k \dot W_{:,v_{i-k}}^{LM}$

The matrix W^{LM} is a word embedding based on n-gram model.



DocNADE is similar to Recurrent Neural Network model where both models estimate the conditional probability of the current input given the previous input. For language modeling task, RNN is less explicit on how much word or context to look back. But DocNADE requires us to explicitly tell the model the number of words to look back. On the other hand, DocNADE has a similar favor to Word2Vec where the document representation is simply an aggregate of all previously seen words. However, DocNADE adds additional transformation on top of hidden units.

Will this type of Autoregressive model fall out of fashion due to the success of Recurrent Network with Attention mechanism and memory model? The current trend suggests that RNN is more flexible and extensible than NADE. Hence, there will be more development and extension of RNN models more and more in the coming year.


Lauly, Stanislas, et al. “Document neural autoregressive distribution estimation.” arXiv preprint arXiv:1603.05962 (2016).


Towards a Neural Statistician (ICLR2017)

One extension of VAE is to add a hierarchy structure. In the classical VAE, the prior is drawn from a standard Gaussian distribution. We can learn this prior from the dataset so that each dataset has its own prior distribution.

The generative process is:

  • Draw a dataset prior \mathbf{c} \sim N(\mathbf{0}, \mathbf{I})
  • For each data point in the dataset
    • Draw a latent vector \mathbf{z} \sim P(\cdot | \mathbf{c})
    • Draw a sample \mathbf{x} \sim P(\cdot | \mathbf{z})

The likelihood of the dataset is:

p(D) = \int p(c) \big[ \prod_{x \in D} \int p(x|z;\theta)p(z|c;\theta)dz \big]dc

The paper define the approximate inference network, q(z|x,c;\phi) and q(c|D; \phi) to optimize a variational lowerbound. The single dataset log likelihood lowerboud is:

\mathcal{L}_D = E_{q(c|D;\phi)}\big[  \sum_{x \in d} E_{q(z|c, x; \phi)}[ \log p(x|z;\theta)] - D_{KL}(q(z|c,x;\phi)||p(z|c;\theta)) \big] - D_{KL}(q(c|D;\phi)||p(c))

The interesting contribution of this paper is their statistic network q(c|D; \phi) that approximates the posterior distribution over the context c given the dataset D. Basically, this inference network has an encoder to take each datapoint into a vector e_i = E(x_i). Then, add a pool layer to aggregate e_i into a single vector. This paper uses an element-wise mean. Finally, the final vector is used to generate parameters of a diagonal Gaussian.


This model surprisingly works well for many tasks such as topic models, transfer learning, one-shot learning, etc.


https://arxiv.org/abs/1606.02185 (Poster ICLR 2017)

Improved Variational Autoencoders for Text Modeling using Dilated Convolutions (ICML’17)

One of the reasons that VAE with LSTM as a decoder is less effective than LSTM language model due to the LSTM decoder ignores conditioning information from the encoder. This paper uses a dilated CNN as a decoder to improve a perplexity on held-out data.

Language Model

The language model can be modeled as:

p(\textbf{x}) = \prod_t p(x_t | x_1, x_2, \cdots, x_{t-1})

LSTM language model use this conditional distribution to predict the next word.

By adding an additional contextual random variable [2], the language model can be expressed as:

p(\textbf{x}, \textbf{z}) = \prod_t p(x_t | x_1, x_2, \cdots, x_{t-1}, \textbf{z})

The second model is more flexible as it explicitly model a high variation in the sequential data. Without a careful training, the VAE-based language model often degrades to a standard language model as the decoder chooses to ignore the latent variable generated by the encoder.

Dilated CNN

The authors replace LSTM decoder with Dilated CNN decoder to control the contextual capacity. That is when the convolutional kernel is large, the decoder covers longer context as it resembles an LSTM. But if the kernel becomes smaller, the model becomes more like a bag-of-word. The size of kernel controls the contextual capacity which is how much the past context we want to use to predict the current word.


Stacking Dilated CNN is crucial for a better performance because we want to exponentially increase the context windows. WaveNet [3] also uses this approach.


By replacing VAE with a more suitable decoder, VAE can now perform well on language model task. Since the textual sequence does not contain a lot of variation, we may not notice an obvious improvement. We may see more significant improvement in a more complex sequential data such as speech or audio signals. Also, the experimental results show that Dilated CNN is better than LSTM as a decoder but the improvement in terms of perplexity and NLL are still incremental to the standard LSTM language model. We hope to see stronger language models using VAE in the future.


[1] Yang, Zichao, et al. “Improved Variational Autoencoders for Text Modeling using Dilated Convolutions.” arXiv preprint arXiv:1702.08139 (2017).

[2] Bowman, Samuel R., et al. “Generating sentences from a continuous space.” arXiv preprint arXiv:1511.06349 (2015).

[3] Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).