Broadcasting

We have just submitted our paper to SIGIR2017. Now, I have times to optimize my experimental code. To speed up this part of the code is quite important because I have to run many evaluations on different permutation of models, dataset, and hyperparameters. It should be fast.

My evaluation is as follows:

I have a set of test samples and train samples, I want to compute a score for every pair of test and train samples. Sound familiar right? If the score is a Euclidean distance, then I want to calculate all-pair Euclidean distance.

Last month, I was running out of times and did not want to introduce bugs into my experimental code. From my experience, optimize code is harder to read and if I screw some part of the code, I will end up getting a workable code but split out a wrong number, which is very bad. Hence, I opted for not learning the broadcasting technique in Numpy. Oh boy, now I have learned a hard lesson.

My largest dataset contains 300,000 train samples and 20,000 test samples. It took 15 minutes to evaluate the final score. I need to calculate all-pair scores, rank them, and calculate Precision, Recall, and NDCG. It took 15 minutes for this dataset and that was too long for just evaluation.

I dug up my code again and here is the pseudocode similarly to what I wrote:

  • for idx, test in enumerate(test_data):
    • one_sample_scores = distance(test, train_data) # I did broadcasting here!

This loop is painful. I did broadcasting one test sample against all train data but it was not good enough. Note that the distance function is a function that supports broadcasting.

I replaced the loop with this broadcast:

  • test_data = test_data[: , : , None]
  • train_data = train_data[:, :, None]
  • scores = distance(test, train_data.T)

Just by this simple change, here is the difference in terms of running time. I use timeit to measure the time it took:

Dataset#1 (Train:11032, Test:3671)

Before:  67 seconds – After: 26 seconds

Dataset#2 (Train:21286, Test:3498)

Before:  114 seconds – After: 39 seconds

It is 2x-3x times faster! Imagine, how much I save by this simple change.

Now I tried my largest dataset, unfortunately, I got a memory error!

I think if I batch my testing while applying similar broadcast implementation, I am sure I will speed up my evaluation code. So optimizing code is necessary but I have to be careful because I may end up adding new bugs to my experimental code. Testing is required.

Update** Aug 22, 2017: If you plan to compute a popular distance metric, then pdist function is very fast: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist