[Numpy-discussion] Byte aligned arrays

Fri Dec 21 06:45:41 EST 2012

On 12/21/12 11:58 AM, Henry Gomersall wrote:
> On Fri, 2012-12-21 at 11:34 +0100, Francesc Alted wrote:
>>> Also this convolution code:
>>> https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c
>>>
>>> Shows a small but repeatable speed-up (a few %) when using some
>> aligned
>>> loads (as many as I can work out to use!).
>> Okay, so a 15% is significant, yes.  I'm still wondering why I did
>> not
>> get any speedup at all using MKL, but probably the reason is that it
>> manages the unaligned corners of the datasets first, and then uses an
>> aligned access for the rest of the data (but just guessing here).
> With SSE in that convolution code example above (in which all alignments
> need be considered for each output element), I note a significant
> speedup by creating 4 copies of the float input array using memcopy,
> each shifted by 1 float (so the 5th element is aligned again). Despite
> all the extra copies its still quicker than using an unaligned load.
> However, when one tries the same trick with 8 copies for AVX it's
> actually slower than the SSE case.
>
> The fastest AVX (and any) implementation I have so far is with
> 16-aligned arrays (made with 4 copies as with SSE), with alternate
> aligned and unaligned loads (which is always at worst 16-byte aligned).
>
> Fascinating stuff!

Yes, to say the least.  And it supports the fact that, when fine tuning 
memory access performance, there is no replacement for experimentation 
(in some weird ways many times :)

-- 
Francesc Alted