[Numpy-discussion] [OT] Starving CPUs article featured in IEEE's ComputingNow portal

Sat Mar 20 15:22:57 EDT 2010

On 20 March 2010 14:56, Dag Sverre Seljebotn
<dagss at student.matnat.uio.no> wrote:
> Pauli Virtanen wrote:
>> Anne Archibald wrote:
>>> I'm not knocking numpy; it does (almost) the best it can. (I'm not
>>> sure of the optimality of the order in which ufuncs are executed; I
>>> think some optimizations there are possible.)
>>
>> Ufuncs and reductions are not performed in a cache-optimal fashion, IIRC
>> dimensions are always traversed in order from left to right. Large
>> speedups are possible in some cases, but in a quick try I didn't manage to
>> come up with an algorithm that would always improve the speed (there was a
>> thread about this last year or so, and there's a ticket). Things varied
>> between computers, so this probably depends a lot on the actual cache
>> arrangement.
>>
>> But perhaps numexpr has such heuristics, and we could steal them?
>
> At least in MultiIter (and I always assumed ufuncs too, but perhaps not)
> there's functionality to remove the largest dimension so that it can be
> put innermost in a loop. In many situations, removing the dimension with
> the smallest stride from the iterator would probably work much better.
>
> It's all about balancing iterator overhead and memory overhead. Something
> simple like "select the dimension with length > 200 which has smallest
> stride, or the dimension with largest length if none are above 200" would
> perhaps work well?

There's more to it than that: there's no point (I think) going with
the smallest stride if that stride is more than a cache line (usually
64 bytes). There are further optimizations - often
stride[i]*shape[i]==shape[j] for some (i,j), and in that case these
two dimensions can be treated as one with stride stride[i] and length
shape[i]*shape[j]. Applying this would "unroll" all C- or
Fortran-contiguous arrays to a single non-nested loop.

Of course, reordering ufuncs is potentially hazardous in terms of
breaking user code: the effect of
A[1:]*=A[:-1]
depends on whether you run through A in index order or reverse index
order (which might be the order it's stored in memory, potentially
faster from a cache point of view).

Anne
>
> Dag Sverre
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>