[Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

Julian Taylor jtaylor.debian at googlemail.com
Thu Apr 17 13:28:10 EDT 2014


On 17.04.2014 18:06, Francesc Alted wrote:

> 
> In [4]: x_unaligned = np.zeros(shape,
> dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']

on arrays of this size you won't see alignment issues you are dominated
by memory bandwidth. If at all you will only see it if the data fits
into the cache.
Its also about unaligned to simd vectors not unaligned to basic types.
But it doesn't matter anymore on modern x86 cpus. I guess for array data
cache line splits should also not be a big concern.

Aligned allocators are not the only allocator which might be useful in
numpy. Modern CPUs also support larger pages than 4K (huge pages up to
1GB in size) which reduces TLB cache misses. Memory of this type
typically needs to be allocated with special mmap flags, though newer
kernel versions can now also provide this memory to transparent
anonymous pages (normal non-file mmaps).

> 
> In [8]: import numexpr as ne
> 
> In [9]: %timeit res = ne.evaluate('x_aligned ** 2')
> 10 loops, best of 3: 133 ms per loop
> 
> In [10]: %timeit res = ne.evaluate('x_unaligned ** 2')
> 10 loops, best of 3: 134 ms per loop
> 
> i.e. there is not a significant difference between aligned and unaligned
> access to data.
> 
> I wonder if the same technique could be applied to NumPy.


you already can do so with relatively simple means:
http://nbviewer.ipython.org/gist/anonymous/10942132

If you change the blocking function to get a function as input and use
inplace operations numpy can even beat numexpr (though I used the
numexpr Ubuntu package which might not be compiled optimally)
This type of transformation can probably be applied on the AST quite easily.

> 
> Francesc
> 
> 
> El 17/04/14 16:26, Aron Ahmadia ha escrit:
>> Hmnn, I wasn't being clear :)
>>
>> The default malloc on BlueGene/Q only returns 8 byte alignment, but
>> the SIMD units need 32-byte alignment for loads, stores, and
>> operations or performance suffers.  On the /P the required alignment
>> was 16-bytes, but malloc only gave you 8, and trying to perform
>> vectorized loads/stores generated alignment exceptions on unaligned
>> memory.
>>
>> See https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q and
>> https://computing.llnl.gov/tutorials/bgp/BGP-usage.Walkup.pdf (slides
>> 14 for overview, 15 for the effective performance difference between
>> the unaligned/aligned code) for some notes on this.
>>
>> A
>>
>>
>>
>>
>> On Thu, Apr 17, 2014 at 10:18 AM, Nathaniel Smith <njs at pobox.com
>> <mailto:njs at pobox.com>> wrote:
>>
>>     On 17 Apr 2014 15:09, "Aron Ahmadia" <aron at ahmadia.net
>>     <mailto:aron at ahmadia.net>> wrote:
>>     >
>>     > > On the one hand it would be nice to actually know whether
>>     posix_memalign is important, before making api decisions on this
>>     basis.
>>     >
>>     > FWIW: On the lightweight IBM cores that the extremely popular
>>     BlueGene machines were based on, accessing unaligned memory raised
>>     system faults.  The default behavior of these machines was to
>>     terminate the program if more than 1000 such errors occurred on a
>>     given process, and an environment variable allowed you to
>>     terminate the program if *any* unaligned memory access occurred.
>>      This is because unaligned memory accesses were 15x (or more)
>>     slower than aligned memory access.
>>     >
>>     > The newer /Q chips seem to be a little more forgiving of this,
>>     but I think one can in general expect allocated memory alignment
>>     to be an important performance technique for future high
>>     performance computing architectures.
>>
>>     Right, this much is true on lots of architectures, and so malloc
>>     is careful to always return values with sufficient alignment (e.g.
>>     8 bytes) to make sure that any standard operation can succeed.
>>
>>     The question here is whether it will be important to have *even
>>     more* alignment than malloc gives us by default. A 16 or 32 byte
>>     wide SIMD instruction might prefer that data have 16 or 32 byte
>>     alignment, even if normal memory access for the types being
>>     operated on only requires 4 or 8 byte alignment.
>>
>>     -n
>>
>>
>>     _______________________________________________
>>     NumPy-Discussion mailing list
>>     NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
>>     http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> -- 
> Francesc Alted
> 
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 




More information about the NumPy-Discussion mailing list