[Numpy-discussion] Speeding up Numeric
Todd Miller
jmiller at stsci.edu
Fri Jan 28 15:02:18 EST 2005
Nice work!
But... IRC, there's a problem with moving __del__ down to C, possibly
only for a --with-pydebug Python, I can't remember. It's a serious
problem though... it dumps core. I'll try to see if I can come up with
something conditionally compiled.
Related note to Make Todd's Life Easy: use "cvs diff -c" to make
context diffs which "patch" applies effortlessly.
Thanks for getting the ball rolling. 2x is nothing to sneeze at.
Todd
On Fri, 2005-01-28 at 17:27, Francesc Altet wrote:
> Hi Todd,
>
> Nice to see that you can achieved a good speed-up with your
> optimization path. With the next code:
>
> import numarray
> a = numarray.arange(2000)
> a.shape=(1000,2)
> for j in xrange(1000):
> for i in range(len(a)):
> row=a[i]
>
> and original numarray-1.1.1 it took 11.254s (pentium4 at 2GHz). With your
> patch, this time has been reduced to 7.816s. Now, following your
> suggestion to push NumArray.__del__ down into C, I've got a good
> speed-up as well: 5.332s. This is more that twice as fast as the
> unpatched numarray 1.1.1. There is still a long way until we can catch
> Numeric (1.123s), but it is a first step :)
>
> The patch. Please, revise it as I'm not very used with dealing with
> pure C extensions (just a Pyrex user):
>
> Index: Lib/numarraycore.py
> ===================================================================
> RCS file: /cvsroot/numpy/numarray/Lib/numarraycore.py,v
> retrieving revision 1.101
> diff -r1.101 numarraycore.py
> 696,699c696,699
> < def __del__(self):
> < if self._shadows != None:
> < self._shadows._copyFrom(self)
> < self._shadows = None
> ---
> > def __del__(self):
> > if self._shadows != None:
> > self._shadows._copyFrom(self)
> > self._shadows = None
> Index: Src/_numarraymodule.c
> ===================================================================
> RCS file: /cvsroot/numpy/numarray/Src/_numarraymodule.c,v
> retrieving revision 1.65
> diff -r1.65 _numarraymodule.c
> 399a400,411
> > static void
> > _numarray_dealloc(PyObject *self)
> > {
> > PyArrayObject *selfa = (PyArrayObject *) self;
> >
> > if (selfa->_shadows != NULL) {
> > _copyFrom(selfa->_shadows, self);
> > selfa->_shadows = NULL;
> > }
> > self->ob_type->tp_free(self);
> > }
> >
> 421c433
> < 0, /* tp_dealloc */
> ---
> > _numarray_dealloc, /* tp_dealloc */
>
>
> The profile with the new optimizations looks now like:
>
> samples % image name symbol name
> 453 8.6319 python PyEval_EvalFrame
> 372 7.0884 python lookdict_string
> 349 6.6502 python string_hash
> 271 5.1639 libc-2.3.2.so _wordcopy_bwd_aligned
> 210 4.0015 libnumarray.so NA_updateStatus
> 194 3.6966 python _PyString_Eq
> 185 3.5252 libc-2.3.2.so __GI___strcasecmp
> 162 3.0869 python subtype_dealloc
> 158 3.0107 libc-2.3.2.so _int_malloc
> 147 2.8011 libnumarray.so isBufferWriteable
> 145 2.7630 python PyDict_SetItem
> 135 2.5724 _ndarray.so _view
> 131 2.4962 python PyObject_GenericGetAttr
> 122 2.3247 python PyDict_GetItem
> 100 1.9055 python PyString_InternInPlace
> 94 1.7912 libnumarray.so getReadBufferDataPtr
> 77 1.4672 _ndarray.so _simpleIndexingCore
>
> i.e. time spent in libc and libnumarray is going up in the list, as it
> should. Now, we have to concentrate in other points of optimization.
> Perhaps is a good time to have a try on recompiling the kernel and
> getting the call tree...
>
> Cheers,
>
> A Divendres 28 Gener 2005 12:48, Todd Miller va escriure:
> > I got some insight into what I think is the tall pole in the profile:
> > sub-array creation is implemented using views. The generic indexing
> > code does a view() Python callback because object arrays override view
> > (). Faster view() creation for numerical arrays can be achieved like
> > this by avoiding the callback:
> >
> > Index: Src/_ndarraymodule.c
> > ===================================================================
> > RCS file: /cvsroot/numpy/numarray/Src/_ndarraymodule.c,v
> > retrieving revision 1.75
> > diff -c -r1.75 _ndarraymodule.c
> > *** Src/_ndarraymodule.c 14 Jan 2005 14:13:22 -0000 1.75
> > --- Src/_ndarraymodule.c 28 Jan 2005 11:15:50 -0000
> > ***************
> > *** 453,460 ****
> > }
> > } else { /* partially subscripted --> subarray */
> > long i;
> > ! result = (PyArrayObject *)
> > ! PyObject_CallMethod((PyObject *)
> > self,"view",NULL);
> > if (!result) goto _exit;
> >
> > result->nd = result->nstrides = self->nd - nindices;
> > --- 453,463 ----
> > }
> > } else { /* partially subscripted --> subarray */
> > long i;
> > ! if (NA_NumArrayCheck((PyObject *)self))
> > ! result = _view(self);
> > ! else
> > ! result = (PyArrayObject *) PyObject_CallMethod(
> > ! (PyObject *) self,"view",NULL);
> > if (!result) goto _exit;
> >
> > result->nd = result->nstrides = self->nd - nindices;
> >
> > I committed the patch above to CVS for now. This optimization makes
> > view() "non-overridable" for NumArray subclasses so there is probably a
> > better way of doing this.
> >
> > One other thing that struck me looking at your profile, and it has been
> > discussed before, is that NumArray.__del__() needs to be pushed (back)
> > down into C. Getting rid of __del__ would also synergyze well with
> > making an object freelist, one aspect of which is capturing unneeded
> > objects rather than destroying them.
> >
> > Thanks for the profile.
> >
> > Regards,
> > Todd
> >
> > On Thu, 2005-01-27 at 21:36 +0100, Francesc Altet wrote:
> > > Hi,
> > >
> > > After a while of waiting for some free time, I'm playing myself with
> > > the excellent oprofile, and try to help in reducing numarray creation.
> > >
> > > For that goal, I selected the next small benchmark:
> > >
> > > import numarray
> > > a = numarray.arange(2000)
> > > a.shape=(1000,2)
> > > for j in xrange(1000):
> > > for i in range(len(a)):
> > > row=a[i]
> > >
> > > I know that it mixes creation with indexing cost, but as the indexing
> > > cost of numarray is only a bit slower (perhaps a 40%) than Numeric,
> > > while array creation time is 5 to 10 times slower, I think this
> > > benchmark may provide a good starting point to see what's going on.
> > >
> > > For numarray, I've got the next results:
> > >
> > > samples % image name symbol name
> > > 902 7.3238 python PyEval_EvalFrame
> > > 835 6.7798 python lookdict_string
> > > 408 3.3128 python PyObject_GenericGetAttr
> > > 384 3.1179 python PyDict_GetItem
> > > 383 3.1098 libc-2.3.2.so memcpy
> > > 358 2.9068 libpthread-0.10.so __pthread_alt_unlock
> > > 293 2.3790 python _PyString_Eq
> > > 273 2.2166 libnumarray.so NA_updateStatus
> > > 273 2.2166 python PyType_IsSubtype
> > > 271 2.2004 python countformat
> > > 252 2.0461 libc-2.3.2.so memset
> > > 249 2.0218 python string_hash
> > > 248 2.0136 _ndarray.so _universalIndexing
> > >
> > > while for Numeric I've got this:
> > >
> > > samples % image name symbol name
> > > 279 15.6478 libpthread-0.10.so __pthread_alt_unlock
> > > 216 12.1144 libc-2.3.2.so memmove
> > > 187 10.4879 python lookdict_string
> > > 162 9.0858 python PyEval_EvalFrame
> > > 144 8.0763 libpthread-0.10.so __pthread_alt_lock
> > > 126 7.0667 libpthread-0.10.so __pthread_alt_trylock
> > > 56 3.1408 python PyDict_SetItem
> > > 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock
> > > 45 2.5238 _numpy.so
> > > PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so
> > > __malloc
> > > 36 2.0191 libc-2.3.2.so __cfree
> > >
> > > one preliminary result is that numarray spends a lot more time in
> > > Python space than do Numeric, as Todd already said here. The problem
> > > is that, as I have not yet patched my kernel, I can't get the call
> > > tree, and I can't look for the ultimate responsible for that.
> > >
> > > So, I've tried to run the profile module included in the standard
> > > library in order to see which are the hot spots in python:
> > >
> > > $ time ~/python.nobackup/Python-2.4/python -m profile -s time
> > > create-numarray.py
> > > 1016105 function calls (1016064 primitive calls) in 25.290 CPU
> > > seconds
> > >
> > > Ordered by: internal time
> > >
> > > ncalls tottime percall cumtime percall filename:lineno(function)
> > > 1 19.220 19.220 25.290 25.290 create-numarray.py:1(?)
> > > 999999 5.530 0.000 5.530 0.000
> > > numarraycore.py:514(__del__) 1753 0.160 0.000 0.160 0.000
> > > :0(eval)
> > > 1 0.060 0.060 0.340 0.340 numarraycore.py:3(?)
> > > 1 0.050 0.050 0.390 0.390 generic.py:8(?)
> > > 1 0.040 0.040 0.490 0.490 numarrayall.py:1(?)
> > > 3455 0.040 0.000 0.040 0.000 :0(len)
> > > 1 0.030 0.030 0.190 0.190
> > > ufunc.py:1504(_makeCUFuncDict) 51 0.030 0.001 0.070 0.001
> > > ufunc.py:184(_nIOArgs) 3572 0.030 0.000 0.030 0.000
> > > :0(has_key)
> > > 2582 0.020 0.000 0.020 0.000 :0(append)
> > > 1000 0.020 0.000 0.020 0.000 :0(range)
> > > 1 0.010 0.010 0.010 0.010 generic.py:510
> > > (_stridesFromShape)
> > > 42/1 0.010 0.000 25.290 25.290 <string>:1(?)
> > >
> > > but, to say the truth, I can't really see where the time is exactly
> > > consumed. Perhaps somebody with more experience can put more light on
> > > this?
> > >
> > > Another thing that I find intriguing has to do with Numeric and
> > > oprofile output. Let me remember:
> > >
> > > samples % image name symbol name
> > > 279 15.6478 libpthread-0.10.so __pthread_alt_unlock
> > > 216 12.1144 libc-2.3.2.so memmove
> > > 187 10.4879 python lookdict_string
> > > 162 9.0858 python PyEval_EvalFrame
> > > 144 8.0763 libpthread-0.10.so __pthread_alt_lock
> > > 126 7.0667 libpthread-0.10.so __pthread_alt_trylock
> > > 56 3.1408 python PyDict_SetItem
> > > 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock
> > > 45 2.5238 _numpy.so
> > > PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so
> > > __malloc
> > > 36 2.0191 libc-2.3.2.so __cfree
> > >
> > > we can see that a lot of the time in the benchmark using Numeric is
> > > consumed in libc space (a 37% or so). However, only a 16% is used in
> > > memory-related tasks (memmove, malloc and free) while the rest seems
> > > to be used in thread issues (??). Again, anyone can explain why the
> > > pthread* routines take so many time, or why they appear here at all?.
> > > Perhaps getting rid of these calls might improve the Numeric
> > > performance even further.
> > >
> > > Cheers,
> >
> > -------------------------------------------------------
> > This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
> > Tool for open source databases. Create drag-&-drop reports. Save time
> > by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
> > Download a FREE copy at http://www.intelliview.com/go/osdn_nl
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/numpy-discussion
--
More information about the NumPy-Discussion
mailing list