Many machine learning algorithms access their input data primarily through pairwise distances. It is therefore important that we have a fast function that computes pairwise (Euclidean) distances of input vectors. Assume we have n d-dimensional data vectors x_ i and m d-dimensional vectors z_ j. And let us define a d×n matrix X, where the ith column is a vector x_ i and similarly a d×m matrix Z. Our distance function takes as input the two matrices X and Z and outputs a n×m matrix D, where the (i,j)th element is the Euclidean distance between x_ i and z_ j.
How not to program in MATLAB
A naive MATLAB implementation of such a distance function would look like this:function D=l2distanceSlow(X,Z) [d,n]=size(X); % dimension of X [~,m]=size(Z); % dimension of Z D=zeros(n,m); % allocate memory for the output matrix for i=1:n % loop over vectors in X for j=1:m % loop over vectors in Z D(i,j)=0.0; for k=1:d % loop over dimensions D(i,j)=D(i,j)+(X(k,i)-Z(k,j))^2; % compute l2-distance between the ith and jth vector end; D(i,j)=sqrt(D(i,j)); % take square root end; end;
Please read through the code carefully and make sure you understand it. It is perfectly correct and will produce the correct result … eventually. To see what is wrong, run the following commands in the MATLAB console: X=rand(100,10000); Z=rand(100,2000); tic;D=l2distanceSlow(X,Z);toc
This code defines some random data in X and Z and computes the corresponding distance matrix D. The tic,toc statements time how long this takes. On my laptop it takes over a minute. This is way too slow!!! If I were to compute the distances between larger training and testing sets or higher dimensional data points (imagine for instance images with millions of pixels), it would take days!! The problem is that the distance function uses a programming style that is well suited for C++ or Java, but not MATLAB!!
What is the problem? MATLAB is an interpreted language with a lot of slow execution overhead. This means you have to pay dearly for every command you execute. In our example we have three nested loops and call the inner most line many many times. This means almost our entire running time is spent in execution overhead. As a general rule, you should avoid tight loops at all cost.
How to program in MATLAB
Although there is an execution overhead per line in MATLAB, matrix operations are extremely optimized and very fast. In order to become a successful MATLAB programmer, you need to free yourself from the “for-loop” thinking and start thinking in terms of matrix operations. MATLAB can be very fast, if almost all the time is spent on a few heavy duty matrix operations. In this exercise you will do this, and transform the function above into a few matrix operations without any loops at all.
The key to efficient programming in MATLAB and machine learning in general is to think about it in terms of mathematics, and not in terms of loops.
For our exercise the trick is to express the Euclidean distance between pairs of points in terms of inner products and find a representation in terms of a sum of matrices. So, before you implement anything, take pen an paper and figure out this representation!
Now, implement the function
l2distance.m, which computes the Euclidean distance matrix D without a single loop. Test the distance function again: X=rand(100,10000); Z=rand(100,2000); tic;D=l2distance(X,Z);toc
How much faster is your code now? Just for a little motivation: mine runs in 0.6 seconds now! And if you would like to improve upon this I can show you a way to get down to 0.4 seconds! Ask me if you are interested. With this implementation you should easily be able to compute the distances between many more high-dimensional feature vectors.