[Numpy-discussion] Array vectorization in numpy

Wed Jul 20 07:31:41 EDT 2011

Wed, 20 Jul 2011 09:04:09 +0000, Pauli Virtanen wrote:
> Wed, 20 Jul 2011 08:49:21 +0200, Carlos Becker wrote:
>> Those are very interesting examples. I think that pre-allocation is
>> very important, and something similar happens in Matlab if no
>> pre-allocation is done: it takes 3-4x longer than with pre-allocation.
>> The main difference is that Matlab is able to take into account a
>> pre-allocated array/matrix, probably avoiding the creation of a
>> temporary and writing the results directly in the pre-allocated array.
> 
> You have not demonstrated that the difference you have comes from
> pre-allocation.
> 
> If it would come from pre-allocation, how come I get the same speed as
> an equivalent C implementation, which *does* pre-allocation, using
> exactly the same benchmark codes as you have posted?

Ok, I looked at it at a different machine, and in the end I'll have to
agree with you :)

Some interesting data points: on my Eee laptop (with the test codes
below), I get

    Numpy: 0.0771
    Numpy (preallocated): 0.0383

On a different machine, on the other hand:

    Numpy: 0.0288
    Numpy (preallocated): 0.0283

For larger array sizes the situation starts to change (4000x4000):

    Numpy (allocation & zeroing): 0.161
    Numpy: 0.1953
    Numpy (preallocated): 0.1153

Also interestingly,

    Zeroing (memset, per element): 1.04313e-08    # 4000x4000 array
    Zeroing (memset, per element): 1.05333e-08    # 3000x3000 array
    Zeroing (memset, per element): 1.04427e-08    # 2048x2048 array
    Zeroing (memset, per element): 2.24223e-09    # 2048x2047 array
    Zeroing (memset, per element): 2.1e-09        # 2000x2000 array
    Zeroing (memset, per element): 1.75e-09       # 200x200 array

    Zeroing (memset, preallocated, per element): 2.06667e-09 # 3000x3000
    Zeroing (memset, preallocated, per element): 2.0504e-09  # 2048x2048
    Zeroing (memset, preallocated, per element): 1.94e-09    # 200x200

There is a sharp order-of-magnitude change of speed in malloc+memset
of an array, which is not present in memset itself. (This is then also
reflected in the Numpy performance -- floating point operations probably
don't cost much compared to memory access speed.) It seems that either the
kernel or the C library changes the way it handles allocation at that point.

So yes, it seems that you were right after all: for large arrays
preallocation may be an issue, but the exact size limit depends
on the machine etc. in question, which is why I didn't manage to
see this at first.

    ***

In this particular case, we are somewhat limited by Python on what
optimizations we can do. It is not possible to have the expression
"k = m - 0.5" be translated into `np.subtract(m, 0.5, k)` instead
of `k = np.subtract(m, 0.5)` because this translation is done by
Python itself.

If the translation is taken away from Python, e.g., by switching
to lazy evaluation or via Numexpr, then things can be improved.
There have been some ideas around on implementing matrix algebra
lazily in this way, with the ability of reusing temporary buffers.

The issue of reusing temporaries in expressions such as `a = 0.3*x + y + z`
should however be possible to address within Numpy.

	Pauli

----------------------------------------------------
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <string.h>

int main()
{
    double *a, *b;
    int N = 2000*2000, M=100;
    int j;
    int k;
    clock_t start, end;

    a = (double*)malloc(sizeof(double)*N); 

    b = (double*)malloc(sizeof(double)*N);
    start = clock();
    for (k = 0; k < M; ++k) {
	memset(b, '\0', sizeof(double)*N);
    }
    end = clock();
    printf("Zeroing (memset, preallocated, per element): %g\n",
    ((double)(end-start))/CLOCKS_PER_SEC/M/N);
    free(b);

    start = clock();
    for (k = 0; k < M; ++k) {
        b = (double*)malloc(sizeof(double)*N);
	memset(b, '\0', sizeof(double)*N);
        free(b);
    }
    end = clock();
    printf("Zeroing (memset, per element): %g\n",
    ((double)(end-start))/CLOCKS_PER_SEC/M/N);

    b = (double*)malloc(sizeof(double)*N);
    start = clock();
    for (k = 0; k < M; ++k) {
	for (j = 0; j < N; ++j) {
	    b[j] = a[j] - 0.5;
	}
    }
    end = clock();
    printf("Operation in C (preallocated): %g\n",
    ((double)(end-start))/CLOCKS_PER_SEC/M);
    free(b);

    start = clock();
    for (k = 0; k < M; ++k) {
        b = (double*)malloc(sizeof(double)*N);
	for (j = 0; j < N; ++j) {
	    b[j] = a[j] - 0.5;
	}
        free(b);
    }
    end = clock();
    printf("Operation in C: %g\n",
    ((double)(end-start))/CLOCKS_PER_SEC/M);

    return 0;
}
----------------------------------------------------
import time
import numpy as np

print np.__version__, np.__file__

m = np.ones([2000, 2000],float)
N = 100

print (m.size * m.dtype.itemsize) / 1e6

t1 = time.clock()
for x in range(N):
    k = np.zeros_like(m)
t2 = time.clock()
print "Numpy (allocation & zeroing):", (t2 - t1) / N

t1 = time.clock()
for x in range(N):
    k = m - 0.5
t2 = time.clock()
print "Numpy:", (t2 - t1) / N

t1 = time.clock()
for x in range(N):
    np.subtract(m, 0.5, k)
t2 = time.clock()
print "Numpy (preallocated):", (t2 - t1) / N