[Numpy-discussion] Re: Bytes Object and Metadata

Tue Mar 29 13:55:37 EST 2005

A Dimarts 29 Març 2005 15:23, Francesc Altet va escriure:
> This issue has been brought up to this list some months ago (see [1]).
> I, as for one, have renounced to call NA_updateDataPtr() during table
> reads in PyTables and this speeded up the reading process by 70%,
> which is not a joke. And this speed-up could be theoretically achieved
> in every piece of code that reads like:
>
> for i range(n):
>     a = numarrayobject[i]
>
> that is, whenever a single element in array is accessed.

Well, the statement above is not exactly true.  The overhead
introduced by NA_updateDataPtr (and other functions related with the
buffer object) is mainly important when you call the __getitem__
method from *extensions* and less important (but yet significant!)
when you are in pure Python.

This evening I wanted to evaluate how much would be the acceleration
if it would be not necessary to call NA_updateDataPtr and companions
(i.e. getting rid of the buffer object), found some interesting
results and ended doing a quite long report that took this sunny
Spring evening away from me :(

Despite its rather serious format, please, don't look at it as a
serious demonstration of nothing.  It was made basically because I
need maximum performance on __getitem__ operations and was curious on
what Numeric/numarray/Numeric3 can offer in that regard.  If I'm
publishing it here is because it could of help for somebody.

Cheers,

-- 
>qo<   Francesc Altet     http://www.carabos.com/
V  V   Cárabos Coop. V.   Enjoy Data
 ""

 A note on __getitem__ performance on Numeric/numarray on Python extensions
                (with an small follow-up on Numeric3)
==========================================================================
                  Francesc Altet 2005-03-29

                          Abstract
                          ========

Numeric [1] and numarray [2] are Python packages that provide very
convenient containers to deal with large amounts of data in memory in
an efficient way. The fact that they have quite different
implementations lends naturally to areas where one package is better
suited than the other, and vice-versa. In fact, it is a luck to have
such a duality because competence is basic on every software (sane)
ecosystem.

The best way of determining which package is better adapted to do a
certain task is benchmarking. In this report, I have made use of Pyrex
[3] and oprofile [4] in order to decide which is the best candidate to
be used for accessing the data in the containers from C extensions.

In the appendix, some attention has been dedicated as well to
Numeric3, a new-born contender for Numeric and numarray.

Motivation
==========

I need peak performance when accessing to data belonging to
Numeric/numarray objects in my extensions, so I decided to do some
profiling on the next code, which is representative of my own needs:

niter = 5
N = 1000*1000
def matrix_loop(object):
    for j in xrange(niter):
        for i in xrange(N):
            p = object[i]

This basically exercises the __getitem__ special method in
Numeric/numarray objects.

The benchmark
=============

In order to get some comparisons done, I've made a small script
(getitem-numarrayVSNumeric.py) that checks the speed for both kinds of
objects: Numeric and numarray. Also, and in order to reduce the Python
overhead, I've used psyco [3] so that the results may get as close as
possible as if these tests were running inside a Python extension
(made in C). Moreover, I've used the oprofile [4] so as to get an idea
of where the CPU is wasted in this loop.

First of all, I've made a calibration test to measure the time of the
empty loop, that is:

def null_loop():
    for j in xrange(niter):
        for i in xrange(N):
            pass

This time is almost negligible when running with Psyco (and the same
happens inside a C extension), but it takes a *significant* time if
psyco is not active. Once this time has been measured, it is
substracted from the loops that actually exercise __getitem__.

First (naive) timings
=====================

Now, let's see some of the timings that I've done. My platform is a
Pentium4 @ 2GHZ laptop, using Debian GNU/Linux and kernel 2.6.9 and
with gcc 3.3.5.

First of all, I'll list the results without psyco:

$ python2.3 bench/getitem-numarrayVSNumeric.py
Psyco not active
Numeric version: 23.8
numarray version: 1.2.3
Calibration loop: 0.11173081398
Time for numarray(getitem)/iter: 3.82528972626e-07
Time for Numeric(getitem)/iter: 2.51150989532e-07
getitem in Numeric is 1.52310358537 times faster

We can see how the time per iteration for numarray is 380 ns while for
Numeric is 250 ns, which accounts for a 1.5x speed-up of Numeric vs
numarray.

Using psyco to reduce Python overhead
=====================================

However, and even though we have substracted the time for the
calibration loop, there may remain other places were time is wasted in
Python space. Psyco is a good manner to optimize loops and make them
go almost as fast as in C.

Now, the figures using psyco:

$ python2.3 bench/getitem-numarrayVSNumeric.py
Psyco active
Numeric version: 23.8
numarray version: 1.2.3
Calibration loop: 0.0015878200531
Time for numarray(getitem)/iter: 2.4246096611e-07
Time for Numeric(getitem)/iter: 1.19336557388e-07
getitem in Numeric is 2.0317409134 times faster

We can see how the time for the calibration loop has been improved a
factor 100x. Not too bad for a silly loop. Also, the time per
iteration for numarray has dropped to 242 ns and to 119 ns for
Numeric. This accounts for a 2x speedup.

The first conclusion is that numarray is considerably slower than
Numeric when accessing its data. Besides, when using psyco, part of
the Python overhead evaporates, making the gap between Numeric and
numarray loops to grow.

Introducing oprofile: getting a broad view of what's going on
=============================================================

In order to measure the exact difference of __getitem__ method without
the Python overhead (in an extension, for example) I've used oprofile
against the psyco version of the benchmark. Here is the result for the
run with psyco and profiled with oprofile:

# opreport /usr/bin/python2.3
  samples|      %|
------------------
      586 34.1293 libnumarray.so
      454 26.4415 python2.3
      331 19.2778 _numpy.so
      206 11.9977 _ndarray.so
      102  5.9406 memory.so
       22  1.2813 libc-2.3.2.so
        9  0.5242 ld-2.3.2.so
        4  0.2330 multiarray.so
        2  0.1165 _sort.so
        1  0.0582 _psyco.so

libnumarray.so, _ndarray.so, memory.so and _sort.so shared libraries
all belongs to numarray package. The _numpy.so and multiarray.so fall
into Numeric. The time spent in python space is very little (just a
26%, in a great deal thanks to psyco acceleration). The libc-2.3.2.so
and ld-2.3.2.so belongs to the C runtime library, and it is not
possible to decide whether this time has been used by numarray,
Numeric or Python itself, but as the time consumed is very little, we
can safely ignore it.

So, if we sum the samples when the CPU was in the C space (the shared
libs) in numarray, and compare against the time in C space in Numeric,
we get that this is 894 against 331, which means that Numeric is 2.7x
faster than numarray for __getitem__. Of course, this is more than
1.5x and 2x factor that we get earlier because of the time spent in
python space. However, the 2.7x factor is probably more accurate when
one wants to exercise __getitem__ in C extensions.

Most CPU intensive functions using oprofile
==========================================

If we want to look at the most consuming functions in numarray:

# opstack -t 1 /usr/bin/python2.3 | sort -nr| head -10
454      26.6432          python2.3                (no symbols)
331      19.4249          _numpy.so                (no symbols)
145       8.5094          libnumarray.so           NA_getPythonScalar
115       6.7488          libnumarray.so           NA_getByteOffset
101       5.9272          libnumarray.so           isBufferWriteable
98        5.7512          _ndarray.so              _ndarray_subscript
91        5.3404          _ndarray.so              _simpleIndexingCore
73        4.2840          libnumarray.so           NA_updateDataPtr
64        3.7559          memory.so                memory_getbuf
60        3.5211          libnumarray.so           getReadBufferDataPtr

The _numpy.so was stripped out of debugging info, so we can't see
where the time was spent in Numeric. However, we can estimate the cost
for getting a fresh pointer for the data buffer for every data access
in numarray:

isBufferWriteable+NA_updateDataPtr+memory_getbuf+getReadBufferDataPtr

gives a total of 298 samples, which is almost as much as all the time
spent by the Numeric shared library (331). So we can conclude that
having a buffer object in our array object can be a serious drawback
if we want to get maximum performance for accessing the data.

Another point that can be worth to look at is in NA_getByteOffset that
takes 115 samples by itself. This is perhaps a little too much.

Conclusions
===========

To sum up, we can expect that the __getitem__ method in Numeric would
be 1.5x times faster than numarray in pure python code, 2x when using
Psyco, and 2.7x times faster when used in C extensions.

One factor that (partially) explain that numarray is slower in this
area is that it is based on the buffer interface to keep its
data. This feature, while very convenient for certain tasks (like
sharing data with other Python packages or extensions), has a
limitation that make an extension to crash if the memory buffer is
reallocated.

Other solutions (like the "bytes" object [5]) has been proposed to
overcome this limitation (and others) of the buffer
interface. Numeric3 might choose this to avoid these kind of
contention problems created by the buffer interface.

Finally, we have seen how using oprofile could be of unvaluable help
for determining where the hot spots are, not only in our extensions,
but also in other shared libraries in our system. If the shared
libraries also have debugging info on them, then it would be possible
to track down even the most expensive routines in our application.

Appendix
========

Even though it is in the very early stages of existence, I was curious
about how Numeric3 [3] would perform in comparison with Numeric.

By slightly changing getitem-numarrayVSNumeric.py, I've come up with
getitem-NumericVSNumeric3.py, which do the comparison I wanted to.

When running without psyco, I got:

$ python2.3 bench/getitem-NumericVSNumeric3.py
Psyco not active
Numeric version: 23.8
Numeric3 version: Very early alpha release...!
Calibration loop: 0.107951593399
Time for Numeric3(getitem)/iter: 1.18472018242e-06
Time for Numeric(getitem)/iter: 2.45458602905e-07
getitem in Numeric is 4.82655799551 times faster

Ops, Numeric3 is almost 5 times slower than Numeric. So it really
seems to be still in very alpha (you know, premature optimization is
the root of all evils).

Never mind, this is just an exercise. So, let's continue with the
psyco version:

$ python2.3 bench/getitem-NumericVSNumeric3.py
Psyco active
Numeric version: 23.8
Numeric3 version: Very early alpha release...!
Calibration loop: 0.00171356201172
Time for Numeric3(getitem)/iter: 1.04013824463e-06
Time for Numeric(getitem)/iter: 1.19578647614e-07
getitem in Numeric is 8.69836099828 times faster

The gap has increased to 8.6x as expected.

Let's have a look at the most consuming shared libs by using oprofile:

# opreport /usr/bin/python2.3
  samples|      %|
------------------
     1841 33.7365 multiarray.so
     1701 31.1710 libc-2.3.2.so
     1586 29.0636 python2.3
      318  5.8274 _numpy.so
        6  0.1100 ld-2.3.2.so
        3  0.0550 multiarray.so
        2  0.0367 _psyco.so

God! two libraries alone are getting more than half of the CPU:
multiarray.so and libc-2.3.2.so. As we already know that Numeric3
__getitem__ takes much more time than its counterpart in Numeric, we
can conclude that Numeric3 comes with its own multiarray.so, and that
it is responsible for taking one third (33.7%) of the time. Moreover,
multiarray.so should be the responsible to be calling the libc
routines so much, because in our previous benchmarks, the libc calls
never took more than 5% of the time, and here is taking more than 30%.

To conclude, let's see which are the most consuming routines in
Numeric3 for this exercise:

# opstack -t 1 /usr/bin/python2.3 | sort -nr| head -20
1586     30.1750    python2.3                (no symbols)
669      12.7283    libc-2.3.2.so            __GI___strcasecmp
618      11.7580    multiarray.so            PyArray_MapIterNew
374       7.1157    multiarray.so            array_subscript
318       6.0502    _numpy.so                (no symbols)
260       4.9467    libc-2.3.2.so            __realloc
190       3.6149    libc-2.3.2.so            _int_malloc
172       3.2725    multiarray.so            PyArray_New
152       2.8919    libc-2.3.2.so            __strncasecmp
123       2.3402    libc-2.3.2.so            malloc_consolidate
121       2.3021    libc-2.3.2.so            __memalign_internal
118       2.2451    multiarray.so            array_dealloc
102       1.9406    libc-2.3.2.so            _int_realloc
93        1.7694    multiarray.so            fancy_indexing_check
86        1.6362    multiarray.so            arraymapiter_dealloc
79        1.5030    multiarray.so            PyArray_Scalar
76        1.4460    multiarray.so            LONG_copyswapn
62        1.1796    multiarray.so            PyArray_UpdateFlags
57        1.0845    multiarray.so            PyArray_DescrFromType

While we can see that a lot of time is spent inside the multiarray.so
of Numeric3 it also catch our attention that a lot of time is spent
doing the __GI___strcasecmp system call. This is very strange, because
our arrays are made of integers and calling strcasecmp on each
iteration seems like very unnecessary.

In order to know who is calling strcasecmp (i.e. get the call tree),
oprofile needs a special patched version of the linux kernel. But this
is material for another story.

References
==========

[1] http://numpy.sourceforge.net/
[2] http://stsdas.stsci.edu/numarray/
[3] http://psyco.sourceforge.net/
[4] http://oprofile.sourceforge.net/
[5] http://www.python.org/peps/pep-0296.html

-------------- next part --------------
A non-text attachment was scrubbed...
Name: getitem-numarrayVSNumeric.py
Type: application/x-python
Size: 1280 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20050329/b058cb9f/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: getitem-NumericVSNumeric3.py
Type: application/x-python
Size: 1281 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20050329/b058cb9f/attachment-0003.bin>