[Numpy-discussion] aligned / unaligned structured dtype behavior

Fri Mar 8 10:18:10 EST 2013

I agree that documenting this better would be useful to many people.

So if someone what to summarize this and put it in the doc, I think
many people will appreciate this.

Fred

On Thu, Mar 7, 2013 at 10:28 PM, Kurt Smith <kwmsmith at gmail.com> wrote:
> On Thu, Mar 7, 2013 at 12:26 PM, Frédéric Bastien <nouiz at nouiz.org> wrote:
>> Hi,
>>
>> It is normal that unaligned access are slower. The hardware have been
>> optimized for aligned access. So this is a user choice space vs speed.
>
> The quantitative difference is still important, so this thread is
> useful for future reference, I think.  If reading in data into a
> packed array is 3x faster than reading into an aligned array, but the
> core computation is 4x slower with a packed array...you get the idea.
>
> I would have benefitted years ago knowing (1) numpy structured dtypes
> are packed by default, and (2) computations with unaligned data can be
> several factors slower than aligned.  That's strong motivation to
> always make sure I'm using 'aligned=True' except when memory usage is
> an issue, or for file IO with packed binary data, etc.
>
>> We can't go around that. We can only minimize the cost of unaligned
>> access in some cases, but not all and those optimization depend of the
>> CPU. But newer CPU have lowered in cost of unaligned access.
>>
>> I'm surprised that Theano worked with the unaligned input. I added
>> some check to make this raise an error, as we do not support that!
>> Francesc, can you check if Theano give the good result? It is possible
>> that someone (maybe me), just copy the input to an aligned ndarray
>> when we receive an not aligned one. That could explain why it worked,
>> but my memory tell me that we raise an error.
>>
>> As you saw in the number, this is a bad example for Theano as the
>> function compiled is too fast . Their is more Theano overhead then
>> computation time in that example. We have reduced recently the
>> overhead, but we can do more to lower it.
>>
>> Fred
>>
>> On Thu, Mar 7, 2013 at 1:06 PM, Francesc Alted <francesc at continuum.io> wrote:
>>> On 3/7/13 6:47 PM, Francesc Alted wrote:
>>>> On 3/6/13 7:42 PM, Kurt Smith wrote:
>>>>> And regarding performance, doing simple timings shows a 30%-ish
>>>>> slowdown for unaligned operations:
>>>>>
>>>>> In [36]: %timeit packed_arr['b']**2
>>>>> 100 loops, best of 3: 2.48 ms per loop
>>>>>
>>>>> In [37]: %timeit aligned_arr['b']**2
>>>>> 1000 loops, best of 3: 1.9 ms per loop
>>>>
>>>> Hmm, that clearly depends on the architecture.  On my machine:
>>>>
>>>> In [1]: import numpy as np
>>>>
>>>> In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
>>>>
>>>> In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
>>>>
>>>> In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
>>>>
>>>> In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
>>>>
>>>> In [6]: baligned = aligned_arr['b']
>>>>
>>>> In [7]: bpacked = packed_arr['b']
>>>>
>>>> In [8]: %timeit baligned**2
>>>> 1000 loops, best of 3: 1.96 ms per loop
>>>>
>>>> In [9]: %timeit bpacked**2
>>>> 100 loops, best of 3: 7.84 ms per loop
>>>>
>>>> That is, the unaligned column is 4x slower (!).  numexpr allows
>>>> somewhat better results:
>>>>
>>>> In [11]: %timeit numexpr.evaluate('baligned**2')
>>>> 1000 loops, best of 3: 1.13 ms per loop
>>>>
>>>> In [12]: %timeit numexpr.evaluate('bpacked**2')
>>>> 1000 loops, best of 3: 865 us per loop
>>>
>>> Just for completeness, here it is what Theano gets:
>>>
>>> In [18]: import theano
>>>
>>> In [20]: a = theano.tensor.vector()
>>>
>>> In [22]: f = theano.function([a], a**2)
>>>
>>> In [23]: %timeit f(baligned)
>>> 100 loops, best of 3: 7.74 ms per loop
>>>
>>> In [24]: %timeit f(bpacked)
>>> 100 loops, best of 3: 12.6 ms per loop
>>>
>>> So yeah, Theano is also slower for the unaligned case (but less than 2x
>>> in this case).
>>>
>>>>
>>>> Yes, in this case, the unaligned array goes faster (as much as 30%).
>>>> I think the reason is that numexpr optimizes the unaligned access by
>>>> doing a copy of the different chunks in internal buffers that fits in
>>>> L1 cache.  Apparently this is very beneficial in this case (not sure
>>>> why, though).
>>>>
>>>>>
>>>>> Whereas summing shows just a 10%-ish slowdown:
>>>>>
>>>>> In [38]: %timeit packed_arr['b'].sum()
>>>>> 1000 loops, best of 3: 1.29 ms per loop
>>>>>
>>>>> In [39]: %timeit aligned_arr['b'].sum()
>>>>> 1000 loops, best of 3: 1.14 ms per loop
>>>>
>>>> On my machine:
>>>>
>>>> In [14]: %timeit baligned.sum()
>>>> 1000 loops, best of 3: 1.03 ms per loop
>>>>
>>>> In [15]: %timeit bpacked.sum()
>>>> 100 loops, best of 3: 3.79 ms per loop
>>>>
>>>> Again, the 4x slowdown is here.  Using numexpr:
>>>>
>>>> In [16]: %timeit numexpr.evaluate('sum(baligned)')
>>>> 100 loops, best of 3: 2.16 ms per loop
>>>>
>>>> In [17]: %timeit numexpr.evaluate('sum(bpacked)')
>>>> 100 loops, best of 3: 2.08 ms per loop
>>>
>>> And with Theano:
>>>
>>> In [26]: f2 = theano.function([a], a.sum())
>>>
>>> In [27]: %timeit f2(baligned)
>>> 100 loops, best of 3: 2.52 ms per loop
>>>
>>> In [28]: %timeit f2(bpacked)
>>> 100 loops, best of 3: 7.43 ms per loop
>>>
>>> Again, the unaligned case is significantly slower (as much as 3x here!).
>>>
>>> --
>>> Francesc Alted
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion