[Numpy-discussion] Profiling numpy ? (parts written in C)
Francesc Altet
faltet at carabos.com
Tue Dec 19 09:03:30 EST 2006
A Dimarts 19 Desembre 2006 08:12, David Cournapeau escrigué:
> Hi,
>
> Following the discussion on clip and other functions which *may* be
> slow in numpy, I would like to know if there is a way to easily profile
> numpy, ie functions which are written in C.
> For example, I am not sure to understand why a function like take(a,
> b) with a a double 256x4 array and b a 8000x256 int array takes almost
> 200 ms on a fairly fast CPU; in the source code, I can see that numpy
> uses memmove, and I know memmove to be slower than memcpy. Is there an
> easy way to check that this is coming from memmove (case in which
> nothing much can be done to improve the situation I guess), and not from
> something else ?
For doing profiles on C extensions, you can use cProfile, which has
been included in Python 2.5. See an example of your benchmark using
cProfile below.
I've run it against clip1_bench and clip2_bench. Here are my results
(using a Pentium4 Mobile @ 2 GHz).
For clip1 (i.e. clip from numpy):
17 function calls in 5.131 CPU seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
10 4.638 0.464 4.638 0.464 {method 'clip'
of 'numpy.ndarray' objects}
1 0.453 0.453 0.453 0.453 {method 'randn'
of 'mtrand.RandomState' objects}
1 0.020 0.020 4.658 4.658 clipb2.py:16(clip1_bench)
1 0.002 0.002 5.113 5.113 clipb2.py:10(bench_clip)
1 0.002 0.002 5.115 5.115 <string>:1(<module>)
1 0.000 0.000 0.453 0.453 clipb2.py:6(generate_data_2d)
1 0.000 0.000 0.000 0.000 {range}
1 0.000 0.000 0.000 0.000 {method 'disable'
of '_lsprof.Profiler' objects}
[you can see the C extensions between curly brackets]
For clip2 (i.e. your hand-made clip equivalent):
17 function calls in 3.371 CPU seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.485 2.485 2.911 2.911 clipb2.py:19(clip2_bench)
1 0.456 0.456 0.456 0.456 {method 'randn'
of 'mtrand.RandomState' objects}
10 0.426 0.043 0.426 0.043 {method 'copy'
of 'numpy.ndarray' objects}
1 0.003 0.003 3.369 3.369 clipb2.py:10(bench_clip)
1 0.002 0.002 3.371 3.371 <string>:1(<module>)
1 0.000 0.000 0.456 0.456 clipb2.py:6(generate_data_2d)
1 0.000 0.000 0.000 0.000 {range}
1 0.000 0.000 0.000 0.000 {method 'disable'
of '_lsprof.Profiler' objects}
From these timings, one can see that most of the time in clip1 is
wasted in .clip() method of numpy (nothing really new).
So, cProfile is only showing where the time is spent at the
first-level calls in extension level. If we want more introspection on
the C stack, and you are running un Linux, oprofile
(http://oprofile.sourceforge.net) is a very nice profiler. Here are
the outputs for the above routines on my machine.
For clip1:
Profiling through timer interrupt
samples % image name symbol name
643 54.6769 libc-2.3.6.so memmove
151 12.8401 multiarray.so PyArray_Choose
35 2.9762 umath.so BYTE_multiply
34 2.8912 umath.so DOUBLE_greater
32 2.7211 mtrand.so rk_random
32 2.7211 umath.so DOUBLE_less
30 2.5510 libc-2.3.6.so memcpy
For clip2:
Profiling through timer interrupt
samples % image name symbol name
188 24.5111 libc-2.3.6.so memmove
143 18.6441 multiarray.so _nonzero_indices
126 16.4276 multiarray.so PyArray_MapIterNext
37 4.8240 umath.so DOUBLE_greater
36 4.6936 mtrand.so rk_gauss
33 4.3025 umath.so DOUBLE_less
24 3.1291 libc-2.3.6.so memcpy
So, it seems like if you are right: the bottleneck is in calling the
memmove routine. Looking at the code in PyArray_Choose
(multiarraymodule.c), I've replaced the memmove call by a memcpy one.
Here it is the patch:
--- numpy/core/src/multiarraymodule.c (revision 3487)
+++ numpy/core/src/multiarraymodule.c (working copy)
@@ -2126,7 +2126,7 @@
}
offset = i*elsize;
if (offset >= sizes[mi]) {offset = offset % sizes[mi]; }
- memmove(ret_data, mps[mi]->data+offset, elsize);
+ memcpy(ret_data, mps[mi]->data+offset, elsize);
ret_data += elsize; self_data++;
}
With this patch applied, he have, for clip1:
Profiling through timer interrupt
samples % image name symbol name
659 55.2389 libc-2.3.6.so memcpy
184 15.4233 multiarray.so PyArray_Choose
46 3.8558 mtrand.so rk_gauss
37 3.1014 umath.so BYTE_multiply
34 2.8500 umath.so DOUBLE_greater
34 2.8500 umath.so DOUBLE_less
24 2.0117 libm-2.3.6.so __ieee754_log
So, it seems clear that the use of memcpy hasn't accelerated the
computations at all. This is somewhat striking, because in most
situations, memcpy should perform better (see [1] for a practical
example of this).
My guess is that the real bottleneck is in calling so many times
memmove (once per element in the array). Perhaps the algorithm can be
changed to do a block copy at the beginning and then modify only the
places on which the clip should act (kind of the same that you have
made in Python, but at C level).
[1]<http://thread.gmane.org/gmane.comp.python.numeric.general/4982/focus=5058>
Cheers,
------------------------------------------------------------------
# Example of clip using cProfile
import numpy as N
#==========================
# To benchmark imshow alone
#==========================
def generate_data_2d(fr, nwin, hop, len):
nframes = 1.0 * fr / hop * len
return N.random.randn(nframes, nwin)
def bench_clip():
m = -1.
M = 1.
# 2 minutes (120 sec) of sounds @ 8 kHz with 256 samples with 50 % overlap
data = generate_data_2d(8000, 256, 128, 120)
def clip1_bench(data, niter):
for i in range(niter):
blop = data.clip(m, M)
def clip2_bench(data, niter):
for i in range(niter):
blop = data.copy()
blop[blop<m] = m
blop[blop>M] = M
#clip2_bench(data, 10)
clip1_bench(data, 10)
if __name__ == '__main__':
# test clip
import pstats
import cProfile as prof
profile_wanted = False
if not profile_wanted:
bench_clip()
else:
profile_file = 'clip.prof'
prof.run('bench_clip()', profile_file)
stats = pstats.Stats(profile_file)
stats.strip_dirs()
stats.sort_stats('time', 'calls')
stats.print_stats()
-------------------------------------------------------------------
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
More information about the NumPy-Discussion
mailing list