[Numpy-discussion] Profiling numpy ? (parts written in C)

Tue Dec 19 09:03:30 EST 2006

A Dimarts 19 Desembre 2006 08:12, David Cournapeau escrigué:
> Hi,
>
>    Following the discussion on clip and other functions which *may* be
> slow in numpy, I would like to know if there is a way to easily profile
> numpy, ie functions which are written in C.
>    For example, I am not sure to understand why a function like take(a,
> b) with a a double 256x4 array and b a 8000x256 int array takes almost
> 200 ms on a fairly fast CPU; in the source code, I can see that numpy
> uses memmove, and I know memmove to be slower than memcpy. Is there an
> easy way to check that this is coming from memmove (case in which
> nothing much can be done to improve the situation I guess), and not from
> something else ?

For doing profiles on C extensions, you can use cProfile, which has
been included in Python 2.5. See an example of your benchmark using
cProfile below.

I've run it against clip1_bench and clip2_bench. Here are my results
(using a Pentium4 Mobile @ 2 GHz).

For clip1 (i.e. clip from numpy):

         17 function calls in 5.131 CPU seconds
   Ordered by: internal time, call count
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    4.638    0.464    4.638    0.464 {method 'clip' 
of 'numpy.ndarray' objects}
        1    0.453    0.453    0.453    0.453 {method 'randn' 
of 'mtrand.RandomState' objects}
        1    0.020    0.020    4.658    4.658 clipb2.py:16(clip1_bench)
        1    0.002    0.002    5.113    5.113 clipb2.py:10(bench_clip)
        1    0.002    0.002    5.115    5.115 <string>:1(<module>)
        1    0.000    0.000    0.453    0.453 clipb2.py:6(generate_data_2d)
        1    0.000    0.000    0.000    0.000 {range}
        1    0.000    0.000    0.000    0.000 {method 'disable' 
of '_lsprof.Profiler' objects}

[you can see the C extensions between curly brackets]

For clip2 (i.e. your hand-made clip equivalent):

         17 function calls in 3.371 CPU seconds
   Ordered by: internal time, call count
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.485    2.485    2.911    2.911 clipb2.py:19(clip2_bench)
        1    0.456    0.456    0.456    0.456 {method 'randn' 
of 'mtrand.RandomState' objects}
       10    0.426    0.043    0.426    0.043 {method 'copy' 
of 'numpy.ndarray' objects}
        1    0.003    0.003    3.369    3.369 clipb2.py:10(bench_clip)
        1    0.002    0.002    3.371    3.371 <string>:1(<module>)
        1    0.000    0.000    0.456    0.456 clipb2.py:6(generate_data_2d)
        1    0.000    0.000    0.000    0.000 {range}
        1    0.000    0.000    0.000    0.000 {method 'disable' 
of '_lsprof.Profiler' objects}

From these timings, one can see that most of the time in clip1 is
wasted in .clip() method of numpy (nothing really new).

So, cProfile is only showing where the time is spent at the
first-level calls in extension level. If we want more introspection on
the C stack, and you are running un Linux, oprofile
(http://oprofile.sourceforge.net) is a very nice profiler. Here are
the outputs for the above routines on my machine.

For clip1:

Profiling through timer interrupt
samples  %        image name               symbol name
643      54.6769  libc-2.3.6.so            memmove
151      12.8401  multiarray.so            PyArray_Choose
35        2.9762  umath.so                 BYTE_multiply
34        2.8912  umath.so                 DOUBLE_greater
32        2.7211  mtrand.so                rk_random
32        2.7211  umath.so                 DOUBLE_less
30        2.5510  libc-2.3.6.so            memcpy


For clip2:

Profiling through timer interrupt
samples  %        image name               symbol name
188      24.5111  libc-2.3.6.so            memmove
143      18.6441  multiarray.so            _nonzero_indices
126      16.4276  multiarray.so            PyArray_MapIterNext
37        4.8240  umath.so                 DOUBLE_greater
36        4.6936  mtrand.so                rk_gauss
33        4.3025  umath.so                 DOUBLE_less
24        3.1291  libc-2.3.6.so            memcpy

So, it seems like if you are right: the bottleneck is in calling the
memmove routine. Looking at the code in PyArray_Choose
(multiarraymodule.c), I've replaced the memmove call by a memcpy one.
Here it is the patch:

--- numpy/core/src/multiarraymodule.c   (revision 3487)
+++ numpy/core/src/multiarraymodule.c   (working copy)
@@ -2126,7 +2126,7 @@
                 }
                 offset = i*elsize;
                 if (offset >= sizes[mi]) {offset = offset % sizes[mi]; }
-               memmove(ret_data, mps[mi]->data+offset, elsize);
+               memcpy(ret_data, mps[mi]->data+offset, elsize);
                ret_data += elsize; self_data++;
        }

With this patch applied, he have, for clip1:

Profiling through timer interrupt
samples  %        image name               symbol name
659      55.2389  libc-2.3.6.so            memcpy
184      15.4233  multiarray.so            PyArray_Choose
46        3.8558  mtrand.so                rk_gauss
37        3.1014  umath.so                 BYTE_multiply
34        2.8500  umath.so                 DOUBLE_greater
34        2.8500  umath.so                 DOUBLE_less
24        2.0117  libm-2.3.6.so            __ieee754_log

So, it seems clear that the use of memcpy hasn't accelerated the
computations at all. This is somewhat striking, because in most
situations, memcpy should perform better (see [1] for a practical
example of this).

My guess is that the real bottleneck is in calling so many times
memmove (once per element in the array). Perhaps the algorithm can be
changed to do a block copy at the beginning and then modify only the
places on which the clip should act (kind of the same that you have
made in Python, but at C level).

[1]<http://thread.gmane.org/gmane.comp.python.numeric.general/4982/focus=5058>

Cheers,

------------------------------------------------------------------
# Example of clip using cProfile
import numpy as N

#==========================
# To benchmark imshow alone
#==========================
def generate_data_2d(fr, nwin, hop, len):
    nframes = 1.0 * fr / hop * len
    return N.random.randn(nframes, nwin)

def bench_clip():
    m   = -1.
    M   = 1.
    # 2 minutes (120 sec) of sounds @ 8 kHz with 256 samples with 50 % overlap
    data    = generate_data_2d(8000, 256, 128, 120)

    def clip1_bench(data, niter):
        for i in range(niter):
            blop = data.clip(m, M)
    def clip2_bench(data, niter):
        for i in range(niter):
            blop = data.copy()
            blop[blop<m]    = m
            blop[blop>M]    = M

    #clip2_bench(data, 10)
    clip1_bench(data, 10)

if __name__ == '__main__':
    # test clip
    import pstats
    import cProfile as prof

    profile_wanted = False
    if not profile_wanted:
        bench_clip()
    else:
        profile_file = 'clip.prof'
        prof.run('bench_clip()', profile_file)
        stats = pstats.Stats(profile_file)
        stats.strip_dirs()
        stats.sort_stats('time', 'calls')
        stats.print_stats()
-------------------------------------------------------------------

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"