We have just submitted our paper to SIGIR2017. Now, I have times to optimize my experimental code. To speed up this part of the code is quite important because I have to run many evaluations on different permutation of models, dataset, and hyperparameters. It should be fast.
My evaluation is as follows:
I have a set of test samples and train samples, I want to compute a score for every pair of test and train samples. Sound familiar right? If the score is a Euclidean distance, then I want to calculate all-pair Euclidean distance.
Last month, I was running out of times and did not want to introduce bugs into my experimental code. From my experience, optimize code is harder to read and if I screw some part of the code, I will end up getting a workable code but split out a wrong number, which is very bad. Hence, I opted for not learning the broadcasting technique in Numpy. Oh boy, now I have learned a hard lesson.
My largest dataset contains 300,000 train samples and 20,000 test samples. It took 15 minutes to evaluate the final score. I need to calculate all-pair scores, rank them, and calculate Precision, Recall, and NDCG. It took 15 minutes for this dataset and that was too long for just evaluation.
I dug up my code again and here is the pseudocode similarly to what I wrote:
- for idx, test in enumerate(test_data):
- one_sample_scores = distance(test, train_data) # I did broadcasting here!
This loop is painful. I did broadcasting one test sample against all train data but it was not good enough. Note that the distance function is a function that supports broadcasting.
I replaced the loop with this broadcast:
- test_data = test_data[: , : , None]
- train_data = train_data[:, :, None]
- scores = distance(test, train_data.T)
Just by this simple change, here is the difference in terms of running time. I use timeit to measure the time it took:
Dataset#1 (Train:11032, Test:3671)
Before: 67 seconds – After: 26 seconds
Dataset#2 (Train:21286, Test:3498)
Before: 114 seconds – After: 39 seconds
It is 2x-3x times faster! Imagine, how much I save by this simple change.
Now I tried my largest dataset, unfortunately, I got a memory error!
I think if I batch my testing while applying similar broadcast implementation, I am sure I will speed up my evaluation code. So optimizing code is necessary but I have to be careful because I may end up adding new bugs to my experimental code. Testing is required.
Update** Aug 22, 2017: If you plan to compute a popular distance metric, then pdist function is very fast: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist