[Numpy-discussion] Byte aligned arrays
Francesc Alted
francesc at continuum.io
Thu Dec 20 09:23:01 EST 2012
On 12/20/12 9:53 AM, Henry Gomersall wrote:
> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
>> The only scenario that I see that this would create unaligned arrays
>> is
>> for machines having AVX. But provided that the Intel architecture is
>> making great strides in fetching unaligned data, I'd be surprised
>> that
>> the difference in performance would be even noticeable.
>>
>> Can you tell us which difference in performance are you seeing for an
>> AVX-aligned array and other that is not AVX-aligned? Just curious.
> Further to this point, from an Intel article...
>
> http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
>
> "Aligning data to vector length is always recommended. When using Intel
> SSE and Intel SSE2 instructions, loaded data should be aligned to 16
> bytes. Similarly, to achieve best results use Intel AVX instructions on
> 32-byte vectors that are 32-byte aligned. The use of Intel AVX
> instructions on unaligned 32-byte vectors means that every second load
> will be across a cache-line split, since the cache line is 64 bytes.
> This doubles the cache line split rate compared to Intel SSE code that
> uses 16-byte vectors. A high cache-line split rate in memory-intensive
> code is extremely likely to cause performance degradation. For that
> reason, it is highly recommended to align the data to 32 bytes for use
> with Intel AVX."
>
> Though it would be nice to put together a little example of this!
Indeed, an example is what I was looking for. So provided that I have
access to an AVX capable machine (having 6 physical cores), and that MKL
10.3 has support for AVX, I have made some comparisons using the
Anaconda Python distribution (it ships with most packages linked against
MKL 10.3).
Here it is a first example using a DGEMM operation. First using a NumPy
that is not turbo-loaded with MKL:
In [34]: a = np.linspace(0,1,1e7)
In [35]: b = a.reshape(1000, 10000)
In [36]: c = a.reshape(10000, 1000)
In [37]: time d = np.dot(b,c)
CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s
Wall time: 7.63 s
In [38]: time d = np.dot(c,b)
CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s
Wall time: 78.89 s
This is getting around 2.6 GFlop/s. Now, with a MKL 10.3 NumPy and
AVX-unaligned data:
In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p))
Out[7]: '0x7fcdef3b4010' # 16 bytes alignment
In [8]: a = np.ndarray(1e7, "f8", p)
In [9]: a[:] = np.linspace(0,1,1e7)
In [10]: b = a.reshape(1000, 10000)
In [11]: c = a.reshape(10000, 1000)
In [37]: %timeit d = np.dot(b,c)
10 loops, best of 3: 164 ms per loop
In [38]: %timeit d = np.dot(c,b)
1 loops, best of 3: 1.65 s per loop
That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX).
Now, using MKL 10.3 and AVX-aligned data:
In [21]: p2 = ctypes.create_string_buffer(int(8e7+16));
hex(ctypes.addressof(p))
Out[21]: '0x7f8cb9598010'
In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:] # skip the first 16 bytes
(now is 32-bytes aligned)
In [23]: a2[:] = np.linspace(0,1,1e7)
In [24]: b2 = a2.reshape(1000, 10000)
In [25]: c2 = a2.reshape(10000, 1000)
In [35]: %timeit d2 = np.dot(b2,c2)
10 loops, best of 3: 163 ms per loop
In [36]: %timeit d2 = np.dot(c2,b2)
1 loops, best of 3: 1.67 s per loop
So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX
data is negligible.
One may argue that DGEMM is CPU-bounded and that memory access plays
little role here, and that is certainly true. So, let's go with a more
memory-bounded problem, like computing a transcendental function with
numexpr. First with a with NumPy and numexpr with no MKL support:
In [8]: a = np.linspace(0,1,1e8)
In [9]: %time b = np.sin(a)
CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s
Wall time: 1.42 s
In [10]: import numexpr as ne
In [12]: %time b = ne.evaluate("sin(a)")
CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s
Wall time: 0.37 s
This time is around 4x faster than regular 'sin' in libc, and about the
same speed than a memcpy():
In [13]: %time c = a.copy()
CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s
Wall time: 0.39 s
Now, with a MKL-aware numexpr and non-AVX alignment:
In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p))
Out[8]: '0x7fce435da010' # 16 bytes alignment
In [9]: a = np.ndarray(1e8, "f8", p)
In [10]: a[:] = np.linspace(0,1,1e8)
In [11]: %time b = ne.evaluate("sin(a)")
CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s
Wall time: 0.15 s
That is, more than 2x faster than a memcpy() in this system, meaning
that the problem is truly memory-bounded. So now, with an AVX aligned
buffer:
In [14]: a2 = a[2:] # skip the first 16 bytes
In [15]: %time b = ne.evaluate("sin(a2)")
CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s
Wall time: 0.16 s
Again, times are very close. Just to make sure, let's use the timeit magic:
In [16]: %timeit b = ne.evaluate("sin(a)")
10 loops, best of 3: 159 ms per loop # unaligned
In [17]: %timeit b = ne.evaluate("sin(a2)")
10 loops, best of 3: 154 ms per loop # aligned
All in all, it is not clear that AVX alignment would have an advantage,
even for memory-bounded problems. But of course, if Intel people are
saying that AVX alignment is important is because they have use cases
for asserting this. It is just that I'm having a difficult time to find
these cases.
--
Francesc Alted
More information about the NumPy-Discussion
mailing list