Inference network: RNNs vs NNets

The standard variational autoencoder [1] uses neural networks to approximate the true posterior distribution by mapping an input to mean and variance of a standard Gaussian distribution. A simple modification is to replace the inference network from neural nets to RNN. That what exactly this paper present [2].

Intuitively, the RNN will work on the dataset that each consecutive features are highly correlated. It means that for the public dataset such as MNIST, RNN should have no problem approximate posterior distribution of any MNIST digit.

I started with a classical VAE. First, I trained VAE on MNIST dataset, with the hidden units of 500 for both encoders and decoders. I set the latent dimension to 2 so that I can quickly visualize on 2D plot.

 

MNIST_2D_VAE

2D embedding using Neural Nets (2-layers) as inference network

Some digits are clustered together but some are mixed together because VAE does not know the label of the digits. Thus, it will still put similar digits nearby, aka digit 7’s are right next to digit 9’s. Many digit 3 and 2 are mixed together. To have a better separation between each digit classes, the label information shall be utilized. In fact, our recent publication to SIGIR’2017 utilizes the label information in order to cluster similar documents together.

But come back to our original research question. Is RNN really going to improve the quality of the embedding vectors?

MNIST_2D_RNN

2D embedding using LSTM as inference network

 

The above 2D plot shows that using LSTM as an inference network has a slightly different embedding space.

MNIST_2D_GRU

2D embedding vectors of randomly chosen MNIST digits using GRU as inference network

LSTM and GRU also generate slightly different embedding vectors. The recurrent model tends to spread out each digit class. For example, digit 6’s (orange) are spread out. All models mixed digit 4 and 9 together. We should know that mixing digits together might not be a bad thing because some writing digit 4 are very similar to 9. This probably indicates that the recurrent model can capture more subtle similarity between digits.

Now, we will see if RNN model might generate better-looking digits than a standard model.

GRU_gen_ditis

GRU

LSTM_gen_digits

LSTM

VAE_gen_digits

neural nets

It is difficult to tell which models are better. In term of training time, neural nets are the fastest, and LSTM is the slowest. It could be that we have not utilize the strength of RNN yet. Since we are working on MNIST dataset, it might be easy for a traditional model (Neural nets) to perform well. What if we train the model on text datasets such as Newsgroup20? Intuitively, RNN should be able to capture the sequential information. We might get a better embedding space, maybe? Next time we will investigate further on text dataset.

References:

[1] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[2] Fabius, Otto, and Joost R. van Amersfoort. “Variational recurrent auto-encoders.” arXiv preprint arXiv:1412.6581 (2014).

Advertisements

Broadcasting

We have just submitted our paper to SIGIR2017. Now, I have times to optimize my experimental code. To speed up this part of the code is quite important because I have to run many evaluations on different permutation of models, dataset, and hyperparameters. It should be fast.

My evaluation is as follows:

I have a set of test samples and train samples, I want to compute a score for every pair of test and train samples. Sound familiar right? If the score is a Euclidean distance, then I want to calculate all-pair Euclidean distance.

Last month, I was running out of times and did not want to introduce bugs into my experimental code. From my experience, optimize code is harder to read and if I screw some part of the code, I will end up getting a workable code but split out a wrong number, which is very bad. Hence, I opted for not learning the broadcasting technique in Numpy. Oh boy, now I have learned a hard lesson.

My largest dataset contains 300,000 train samples and 20,000 test samples. It took 15 minutes to evaluate the final score. I need to calculate all-pair scores, rank them, and calculate Precision, Recall, and NDCG. It took 15 minutes for this dataset and that was too long for just evaluation.

I dug up my code again and here is the pseudocode similarly to what I wrote:

  • for idx, test in enumerate(test_data):
    • one_sample_scores = distance(test, train_data) # I did broadcasting here!

This loop is painful. I did broadcasting one test sample against all train data but it was not good enough. Note that the distance function is a function that supports broadcasting.

I replaced the loop with this broadcast:

  • test_data = test_data[: , : , None]
  • train_data = train_data[:, :, None]
  • scores = distance(test, train_data.T)

Just by this simple change, here is the difference in terms of running time. I use timeit to measure the time it took:

Dataset#1 (Train:11032, Test:3671)

Before:  67 seconds – After: 26 seconds

Dataset#2 (Train:21286, Test:3498)

Before:  114 seconds – After: 39 seconds

It is 2x-3x times faster! Imagine, how much I save by this simple change.

Now I tried my largest dataset, unfortunately, I got a memory error!

I think if I batch my testing while applying similar broadcast implementation, I am sure I will speed up my evaluation code. So optimizing code is necessary but I have to be careful because I may end up adding new bugs to my experimental code. Testing is required.

Update** Aug 22, 2017: If you plan to compute a popular distance metric, then pdist function is very fast: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist

 

 

 

 

 

Speeding up training the SVM model

I have been puzzled this morning on how slowly the SVM model to learn a binary classification. The training time used to way faster a few days ago.

This led me to suspect that there is a bug in my code, and there is one.

I refactored my code, and one common I usually do is to make sure that all matrices are stored as the same type. If they are sparse, the rest must be sparse. If they are an array, the rest should follow it.

Since I am developing a new model, storing data as 2d array is the most convenient because I can easily look up, slice, and calculate a few statistics.

That is why I train the SVM model (LIBSVM) by passing a huge 2D array with many zeros. By passing a huge array, LIBSVM took very long time to train a binary classifier. So once, I turned the 2d array to a sparse matrix, SVM training is now fast again.

Glad, I found it.

Keras: Merge vs merge

This issue gave me a headache because I did not realize that ‘Merge’ and ‘merge’ are different in Keras.

Merge is for merging layers.

merge is for merging tensors.

Without consulting with the docs and forums, there is no way I could figure this out.

References:

https://github.com/fchollet/keras/issues/2646