[Numpy-discussion] Byte aligned arrays

Thu Dec 20 09:23:01 EST 2012

On 12/20/12 9:53 AM, Henry Gomersall wrote:
> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
>> The only scenario that I see that this would create unaligned arrays
>> is
>> for machines having AVX.  But provided that the Intel architecture is
>> making great strides in fetching unaligned data, I'd be surprised
>> that
>> the difference in performance would be even noticeable.
>>
>> Can you tell us which difference in performance are you seeing for an
>> AVX-aligned array and other that is not AVX-aligned?  Just curious.
> Further to this point, from an Intel article...
>
> http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
>
> "Aligning data to vector length is always recommended. When using Intel
> SSE and Intel SSE2 instructions, loaded data should be aligned to 16
> bytes. Similarly, to achieve best results use Intel AVX instructions on
> 32-byte vectors that are 32-byte aligned. The use of Intel AVX
> instructions on unaligned 32-byte vectors means that every second load
> will be across a cache-line split, since the cache line is 64 bytes.
> This doubles the cache line split rate compared to Intel SSE code that
> uses 16-byte vectors. A high cache-line split rate in memory-intensive
> code is extremely likely to cause performance degradation. For that
> reason, it is highly recommended to align the data to 32 bytes for use
> with Intel AVX."
>
> Though it would be nice to put together a little example of this!

Indeed, an example is what I was looking for.  So provided that I have 
access to an AVX capable machine (having 6 physical cores), and that MKL 
10.3 has support for AVX, I have made some comparisons using the 
Anaconda Python distribution (it ships with most packages linked against 
MKL 10.3).

Here it is a first example using a DGEMM operation.  First using a NumPy 
that is not turbo-loaded with MKL:

In [34]: a = np.linspace(0,1,1e7)

In [35]: b = a.reshape(1000, 10000)

In [36]: c = a.reshape(10000, 1000)

In [37]: time d = np.dot(b,c)
CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s
Wall time: 7.63 s

In [38]: time d = np.dot(c,b)
CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s
Wall time: 78.89 s

This is getting around 2.6 GFlop/s.  Now, with a MKL 10.3 NumPy and 
AVX-unaligned data:

In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p))
Out[7]: '0x7fcdef3b4010'  # 16 bytes alignment

In [8]: a = np.ndarray(1e7, "f8", p)

In [9]: a[:] = np.linspace(0,1,1e7)

In [10]: b = a.reshape(1000, 10000)

In [11]: c = a.reshape(10000, 1000)

In [37]: %timeit d = np.dot(b,c)
10 loops, best of 3: 164 ms per loop

In [38]: %timeit d = np.dot(c,b)
1 loops, best of 3: 1.65 s per loop

That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX).

Now, using MKL 10.3 and AVX-aligned data:

In [21]: p2 = ctypes.create_string_buffer(int(8e7+16)); 
hex(ctypes.addressof(p))
Out[21]: '0x7f8cb9598010'

In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:]  # skip the first 16 bytes 
(now is 32-bytes aligned)

In [23]: a2[:] = np.linspace(0,1,1e7)

In [24]: b2 = a2.reshape(1000, 10000)

In [25]: c2 = a2.reshape(10000, 1000)

In [35]: %timeit d2 = np.dot(b2,c2)
10 loops, best of 3: 163 ms per loop

In [36]: %timeit d2 = np.dot(c2,b2)
1 loops, best of 3: 1.67 s per loop

So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX 
data is negligible.

One may argue that DGEMM is CPU-bounded and that memory access plays 
little role here, and that is certainly true.  So, let's go with a more 
memory-bounded problem, like computing a transcendental function with 
numexpr.  First with a with NumPy and numexpr with no MKL support:

In [8]: a = np.linspace(0,1,1e8)

In [9]: %time b = np.sin(a)
CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s
Wall time: 1.42 s

In [10]: import numexpr as ne

In [12]: %time b = ne.evaluate("sin(a)")
CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s
Wall time: 0.37 s

This time is around 4x faster than regular 'sin' in libc, and about the 
same speed than a memcpy():

In [13]: %time c = a.copy()
CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s
Wall time: 0.39 s

Now, with a MKL-aware numexpr and non-AVX alignment:

In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p))
Out[8]: '0x7fce435da010'  # 16 bytes alignment

In [9]: a = np.ndarray(1e8, "f8", p)

In [10]: a[:] = np.linspace(0,1,1e8)

In [11]: %time b = ne.evaluate("sin(a)")
CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s
Wall time: 0.15 s

That is, more than 2x faster than a memcpy() in this system, meaning 
that the problem is truly memory-bounded.  So now, with an AVX aligned 
buffer:

In [14]: a2 = a[2:]  # skip the first 16 bytes

In [15]: %time b = ne.evaluate("sin(a2)")
CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s
Wall time: 0.16 s

Again, times are very close.  Just to make sure, let's use the timeit magic:

In [16]: %timeit b = ne.evaluate("sin(a)")
10 loops, best of 3: 159 ms per loop   # unaligned

In [17]: %timeit b = ne.evaluate("sin(a2)")
10 loops, best of 3: 154 ms per loop   # aligned

All in all, it is not clear that AVX alignment would have an advantage, 
even for memory-bounded problems.  But of course, if Intel people are 
saying that AVX alignment is important is because they have use cases 
for asserting this.  It is just that I'm having a difficult time to find 
these cases.

-- 
Francesc Alted