[Numpy-discussion] Speeding up Numeric

Fri Jan 28 15:02:18 EST 2005

Nice work!  

But... IRC,  there's a problem with moving __del__ down to C,  possibly
only for a --with-pydebug Python, I can't remember.  It's a serious
problem though...  it dumps core.  I'll try to see if I can come up with
something conditionally compiled.

Related note to Make Todd's Life Easy:  use "cvs diff -c" to make
context diffs which "patch" applies effortlessly.

Thanks for getting the ball rolling.   2x is nothing to sneeze at.

Todd

On Fri, 2005-01-28 at 17:27, Francesc Altet wrote:
> Hi Todd,
> 
> Nice to see that you can achieved a good speed-up with your
> optimization path. With the next code:
> 
> import numarray
> a = numarray.arange(2000)
> a.shape=(1000,2)
> for j in xrange(1000):
>     for i in range(len(a)): 
>  row=a[i]
>  
> and original numarray-1.1.1 it took 11.254s (pentium4 at 2GHz). With your
> patch, this time has been reduced to 7.816s. Now, following your
> suggestion to push NumArray.__del__ down into C, I've got a good
> speed-up as well: 5.332s. This is more that twice as fast as the
> unpatched numarray 1.1.1. There is still a long way until we can catch
> Numeric (1.123s), but it is a first step :)
> 
> The patch. Please, revise it as I'm not very used with dealing with
> pure C extensions (just a Pyrex user):
> 
> Index: Lib/numarraycore.py
> ===================================================================
> RCS file: /cvsroot/numpy/numarray/Lib/numarraycore.py,v
> retrieving revision 1.101
> diff -r1.101 numarraycore.py
> 696,699c696,699
> <     def __del__(self):
> <         if self._shadows != None:
> <             self._shadows._copyFrom(self)
> <             self._shadows = None
> ---
> >       def __del__(self):
> >           if self._shadows != None:
> >               self._shadows._copyFrom(self)
> >               self._shadows = None
> Index: Src/_numarraymodule.c
> ===================================================================
> RCS file: /cvsroot/numpy/numarray/Src/_numarraymodule.c,v
> retrieving revision 1.65
> diff -r1.65 _numarraymodule.c
> 399a400,411
> > static void
> > _numarray_dealloc(PyObject *self)
> > {
> >   PyArrayObject *selfa = (PyArrayObject *) self;
> >
> >   if (selfa->_shadows != NULL) {
> >     _copyFrom(selfa->_shadows, self);
> >     selfa->_shadows = NULL;
> >   }
> >   self->ob_type->tp_free(self);
> > }
> >
> 421c433
> <       0,                                      /* tp_dealloc */
> ---
> >       _numarray_dealloc,                      /* tp_dealloc */
> 
> 
> The profile with the new optimizations looks now like:
> 
> samples  %        image name               symbol name
> 453       8.6319  python                   PyEval_EvalFrame
> 372       7.0884  python                   lookdict_string
> 349       6.6502  python                   string_hash
> 271       5.1639  libc-2.3.2.so            _wordcopy_bwd_aligned
> 210       4.0015  libnumarray.so           NA_updateStatus
> 194       3.6966  python                   _PyString_Eq
> 185       3.5252  libc-2.3.2.so            __GI___strcasecmp
> 162       3.0869  python                   subtype_dealloc
> 158       3.0107  libc-2.3.2.so            _int_malloc
> 147       2.8011  libnumarray.so           isBufferWriteable
> 145       2.7630  python                   PyDict_SetItem
> 135       2.5724  _ndarray.so              _view
> 131       2.4962  python                   PyObject_GenericGetAttr
> 122       2.3247  python                   PyDict_GetItem
> 100       1.9055  python                   PyString_InternInPlace
> 94        1.7912  libnumarray.so           getReadBufferDataPtr
> 77        1.4672  _ndarray.so              _simpleIndexingCore
> 
> i.e. time spent in libc and libnumarray is going up in the list, as it
> should. Now, we have to concentrate in other points of optimization.
> Perhaps is a good time to have a try on recompiling the kernel and
> getting the call tree...
> 
> Cheers,
> 
> A Divendres 28 Gener 2005 12:48, Todd Miller va escriure:
> > I got some insight into what I think is the tall pole in the profile:
> > sub-array creation is implemented using views.  The generic indexing
> > code does a view() Python callback because object arrays override view
> > ().  Faster view() creation for numerical arrays can be achieved like
> > this by avoiding the callback:
> >
> > Index: Src/_ndarraymodule.c
> > ===================================================================
> > RCS file: /cvsroot/numpy/numarray/Src/_ndarraymodule.c,v
> > retrieving revision 1.75
> > diff -c -r1.75 _ndarraymodule.c
> > *** Src/_ndarraymodule.c        14 Jan 2005 14:13:22 -0000      1.75
> > --- Src/_ndarraymodule.c        28 Jan 2005 11:15:50 -0000
> > ***************
> > *** 453,460 ****
> >                 }
> >         } else {  /* partially subscripted --> subarray */
> >                 long i;
> > !               result = (PyArrayObject *)
> > !                       PyObject_CallMethod((PyObject *)
> > self,"view",NULL);
> >                 if (!result) goto _exit;
> >
> >                 result->nd = result->nstrides = self->nd - nindices;
> > --- 453,463 ----
> >                 }
> >         } else {  /* partially subscripted --> subarray */
> >                 long i;
> > !               if (NA_NumArrayCheck((PyObject *)self))
> > !                       result = _view(self);
> > !               else
> > !                       result = (PyArrayObject *) PyObject_CallMethod(
> > !                               (PyObject *) self,"view",NULL);
> >                 if (!result) goto _exit;
> >
> >                 result->nd = result->nstrides = self->nd - nindices;
> >
> > I committed the patch above to CVS for now.  This optimization makes
> > view() "non-overridable" for NumArray subclasses so there is probably a
> > better way of doing this.
> >
> > One other thing that struck me looking at your profile,  and it has been
> > discussed before,  is that NumArray.__del__() needs to be pushed (back)
> > down into C.   Getting rid of __del__ would also synergyze well with
> > making an object freelist,  one aspect of which is capturing unneeded
> > objects rather than destroying them.
> >
> > Thanks for the profile.
> >
> > Regards,
> > Todd
> >
> > On Thu, 2005-01-27 at 21:36 +0100, Francesc Altet wrote:
> > > Hi,
> > >
> > > After a while of waiting for some free time, I'm playing myself with
> > > the excellent oprofile, and try to help in reducing numarray creation.
> > >
> > > For that goal, I selected the next small benchmark:
> > >
> > > import numarray
> > > a = numarray.arange(2000)
> > > a.shape=(1000,2)
> > > for j in xrange(1000):
> > >     for i in range(len(a)):
> > >         row=a[i]
> > >
> > > I know that it mixes creation with indexing cost, but as the indexing
> > > cost of numarray is only a bit slower (perhaps a 40%) than Numeric,
> > > while array creation time is 5 to 10 times slower, I think this
> > > benchmark may provide a good starting point to see what's going on.
> > >
> > > For numarray, I've got the next results:
> > >
> > > samples  %        image name               symbol name
> > > 902       7.3238  python                   PyEval_EvalFrame
> > > 835       6.7798  python                   lookdict_string
> > > 408       3.3128  python                   PyObject_GenericGetAttr
> > > 384       3.1179  python                   PyDict_GetItem
> > > 383       3.1098  libc-2.3.2.so            memcpy
> > > 358       2.9068  libpthread-0.10.so       __pthread_alt_unlock
> > > 293       2.3790  python                   _PyString_Eq
> > > 273       2.2166  libnumarray.so           NA_updateStatus
> > > 273       2.2166  python                   PyType_IsSubtype
> > > 271       2.2004  python                   countformat
> > > 252       2.0461  libc-2.3.2.so            memset
> > > 249       2.0218  python                   string_hash
> > > 248       2.0136  _ndarray.so              _universalIndexing
> > >
> > > while for Numeric I've got this:
> > >
> > > samples  %        image name               symbol name
> > > 279      15.6478  libpthread-0.10.so       __pthread_alt_unlock
> > > 216      12.1144  libc-2.3.2.so            memmove
> > > 187      10.4879  python                   lookdict_string
> > > 162       9.0858  python                   PyEval_EvalFrame
> > > 144       8.0763  libpthread-0.10.so       __pthread_alt_lock
> > > 126       7.0667  libpthread-0.10.so       __pthread_alt_trylock
> > > 56        3.1408  python                   PyDict_SetItem
> > > 53        2.9725  libpthread-0.10.so       __GI___pthread_mutex_unlock
> > > 45        2.5238  _numpy.so               
> > > PyArray_FromDimsAndDataAndDescr 39        2.1873  libc-2.3.2.so          
> > >  __malloc
> > > 36        2.0191  libc-2.3.2.so            __cfree
> > >
> > > one preliminary result is that numarray spends a lot more time in
> > > Python space than do Numeric, as Todd already said here. The problem
> > > is that, as I have not yet patched my kernel, I can't get the call
> > > tree, and I can't look for the ultimate responsible for that.
> > >
> > > So, I've tried to run the profile module included in the standard
> > > library in order to see which are the hot spots in python:
> > >
> > > $ time ~/python.nobackup/Python-2.4/python -m profile -s time
> > > create-numarray.py
> > >          1016105 function calls (1016064 primitive calls) in 25.290 CPU
> > > seconds
> > >
> > >    Ordered by: internal time
> > >
> > >    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
> > >         1   19.220   19.220   25.290   25.290 create-numarray.py:1(?)
> > >    999999    5.530    0.000    5.530    0.000
> > > numarraycore.py:514(__del__) 1753    0.160    0.000    0.160    0.000
> > > :0(eval)
> > >         1    0.060    0.060    0.340    0.340 numarraycore.py:3(?)
> > >         1    0.050    0.050    0.390    0.390 generic.py:8(?)
> > >         1    0.040    0.040    0.490    0.490 numarrayall.py:1(?)
> > >      3455    0.040    0.000    0.040    0.000 :0(len)
> > >         1    0.030    0.030    0.190    0.190
> > > ufunc.py:1504(_makeCUFuncDict) 51    0.030    0.001    0.070    0.001
> > > ufunc.py:184(_nIOArgs) 3572    0.030    0.000    0.030    0.000
> > > :0(has_key)
> > >      2582    0.020    0.000    0.020    0.000 :0(append)
> > >      1000    0.020    0.000    0.020    0.000 :0(range)
> > >         1    0.010    0.010    0.010    0.010 generic.py:510
> > > (_stridesFromShape)
> > >      42/1    0.010    0.000   25.290   25.290 <string>:1(?)
> > >
> > > but, to say the truth, I can't really see where the time is exactly
> > > consumed. Perhaps somebody with more experience can put more light on
> > > this?
> > >
> > > Another thing that I find intriguing has to do with Numeric and
> > > oprofile output. Let me remember:
> > >
> > > samples  %        image name               symbol name
> > > 279      15.6478  libpthread-0.10.so       __pthread_alt_unlock
> > > 216      12.1144  libc-2.3.2.so            memmove
> > > 187      10.4879  python                   lookdict_string
> > > 162       9.0858  python                   PyEval_EvalFrame
> > > 144       8.0763  libpthread-0.10.so       __pthread_alt_lock
> > > 126       7.0667  libpthread-0.10.so       __pthread_alt_trylock
> > > 56        3.1408  python                   PyDict_SetItem
> > > 53        2.9725  libpthread-0.10.so       __GI___pthread_mutex_unlock
> > > 45        2.5238  _numpy.so               
> > > PyArray_FromDimsAndDataAndDescr 39        2.1873  libc-2.3.2.so          
> > >  __malloc
> > > 36        2.0191  libc-2.3.2.so            __cfree
> > >
> > > we can see that a lot of the time in the benchmark using Numeric is
> > > consumed in libc space (a 37% or so). However, only a 16% is used in
> > > memory-related tasks (memmove, malloc and free) while the rest seems
> > > to be used in thread issues (??). Again, anyone can explain why the
> > > pthread* routines take so many time, or why they appear here at all?.
> > > Perhaps getting rid of these calls might improve the Numeric
> > > performance even further.
> > >
> > > Cheers,
> >
> > -------------------------------------------------------
> > This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
> > Tool for open source databases. Create drag-&-drop reports. Save time
> > by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
> > Download a FREE copy at http://www.intelliview.com/go/osdn_nl
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/numpy-discussion
--