[Numpy-discussion] aligned / unaligned structured dtype behavior

Thu Mar 7 13:06:03 EST 2013

On 3/7/13 6:47 PM, Francesc Alted wrote:
> On 3/6/13 7:42 PM, Kurt Smith wrote:
>> And regarding performance, doing simple timings shows a 30%-ish
>> slowdown for unaligned operations:
>>
>> In [36]: %timeit packed_arr['b']**2
>> 100 loops, best of 3: 2.48 ms per loop
>>
>> In [37]: %timeit aligned_arr['b']**2
>> 1000 loops, best of 3: 1.9 ms per loop
>
> Hmm, that clearly depends on the architecture.  On my machine:
>
> In [1]: import numpy as np
>
> In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
>
> In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
>
> In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
>
> In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
>
> In [6]: baligned = aligned_arr['b']
>
> In [7]: bpacked = packed_arr['b']
>
> In [8]: %timeit baligned**2
> 1000 loops, best of 3: 1.96 ms per loop
>
> In [9]: %timeit bpacked**2
> 100 loops, best of 3: 7.84 ms per loop
>
> That is, the unaligned column is 4x slower (!).  numexpr allows 
> somewhat better results:
>
> In [11]: %timeit numexpr.evaluate('baligned**2')
> 1000 loops, best of 3: 1.13 ms per loop
>
> In [12]: %timeit numexpr.evaluate('bpacked**2')
> 1000 loops, best of 3: 865 us per loop

Just for completeness, here it is what Theano gets:

In [18]: import theano

In [20]: a = theano.tensor.vector()

In [22]: f = theano.function([a], a**2)

In [23]: %timeit f(baligned)
100 loops, best of 3: 7.74 ms per loop

In [24]: %timeit f(bpacked)
100 loops, best of 3: 12.6 ms per loop

So yeah, Theano is also slower for the unaligned case (but less than 2x 
in this case).

>
> Yes, in this case, the unaligned array goes faster (as much as 30%).  
> I think the reason is that numexpr optimizes the unaligned access by 
> doing a copy of the different chunks in internal buffers that fits in 
> L1 cache.  Apparently this is very beneficial in this case (not sure 
> why, though).
>
>>
>> Whereas summing shows just a 10%-ish slowdown:
>>
>> In [38]: %timeit packed_arr['b'].sum()
>> 1000 loops, best of 3: 1.29 ms per loop
>>
>> In [39]: %timeit aligned_arr['b'].sum()
>> 1000 loops, best of 3: 1.14 ms per loop
>
> On my machine:
>
> In [14]: %timeit baligned.sum()
> 1000 loops, best of 3: 1.03 ms per loop
>
> In [15]: %timeit bpacked.sum()
> 100 loops, best of 3: 3.79 ms per loop
>
> Again, the 4x slowdown is here.  Using numexpr:
>
> In [16]: %timeit numexpr.evaluate('sum(baligned)')
> 100 loops, best of 3: 2.16 ms per loop
>
> In [17]: %timeit numexpr.evaluate('sum(bpacked)')
> 100 loops, best of 3: 2.08 ms per loop

And with Theano:

In [26]: f2 = theano.function([a], a.sum())

In [27]: %timeit f2(baligned)
100 loops, best of 3: 2.52 ms per loop

In [28]: %timeit f2(bpacked)
100 loops, best of 3: 7.43 ms per loop

Again, the unaligned case is significantly slower (as much as 3x here!).

-- 
Francesc Alted