[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

Sat Mar 22 14:41:31 EDT 2008

Am 22.03.2008 um 19:20 schrieb Travis E. Oliphant:

>I think the thing to do is to special-case the code so that if the
>strides work for vectorization, then a different bit of code is executed
>and this current code is used as the final special-case.

>Something like this would be relatively straightforward, if a bit
>tedious, to do.

I've experimented with branching the ufuncs into different constant
strides and aligned/unaligned cases to be able to use SSE using
compiler intrinsics.
I expected a considerable gain as i was using float32 with stride 1
most of the time.
However, profiling revealed that hardly anything was gained because of
1) non-alignment of the vectors.... this _could_ be handled by
shuffled loading of the values though
2) the fact that my application used relatively large vectors that
wouldn't fit into the CPU cache, hence the memory transfer slowed down
the CPU.

I found the latter to be a real showstopper for most of my experiments
with SIMD. It's especially a problem for numpy because smaller vectors
have a lot of Python/numpy overhead, and larger ones don't really
benefit due to cache exhaustion.
I'm curious whether OpenMP gives better results, as multi-cores often
share their caches.

greetings,
Thomas