[SciPy-dev] Inclusion of cython code in scipy

Thu Apr 24 07:42:14 EDT 2008

A Thursday 24 April 2008, Stéfan van der Walt escrigué:
> 2008/4/24 Prabhu Ramachandran <prabhu at aero.iitb.ac.in>:
> >  Lets take a simple case of someone wanting to handle a growing
> >  collection of say a million particles and do something to them. 
> > How do you do that in cython/pyrex and get the performance of C and
> > interface to numpy?  Worse, even if it were possible, you'll still
> > need to know something about allocating memory in C and
> > manipulating pointers.  I can do that with C++ and SWIG today.
>
> That's the point: you, being a well-established programmer can do it
> easily, but most Python programmers would struggle doing that through
> some C or C++ API.  I think this would be pretty easy to do in
> Cython:
>
> 1. Write a function, say create_workspace(nr_elements), that creates
> a new ndarray and returns it:
>
>     cdef ndarray results_arr = np.empty((nr_elements,),
> dtype=np.double)
>
> 2. Grab a pointer to the memory (this should become a lot easier
> after GSOC 2008):
>
>     cdef double* results = <double*>results_arr.data
>
> 3. Run your loop in which you produce data points.  The moment you
> have more results than
> the output array can hold, call create_workspace(current_size**2),
> and use normal numpy indexing to copy the old results to the new
> location:
>
>     new_results_arr[:current_size] = old_results_arr
>
> 4. Rinse and repeat
>
> The beauty of the Cython approach is that you
>
> a) Never have to worry about INCREF and DECREF
>
> b) Can use Python calls within C functions.  You don't want to do
> that in your fast inner loop, but take the example above: we only
> copy arrays infrequently, and then we'd like to have the full power
> of numpy indexing.  Suddenly, sorting, averaging, summing becomes a
> one-liner, just like in Python, at the expense of one Python call
> (and this won't affect execution time in the above example).
>
> c) Debug in a much cleaner way than C++ or C code: fewer memory
> leaks, introspection of source etc.

Stéfan has shown excellent points about Pyrex/Cython.  Let me just add 
that if you start to have a large library of extensions, you can also 
avoid the cost of Python calls if what you want is to use one extension 
method from another extension method.

For example, when I know that a method is going to be public, I'm very 
used to declare two versions: one that is callable directly from 
another extension (i.e. without the Python call cost) and another that 
is callable from Python.  So, in the code:

  def getitem(self, long nslot, ndarray nparr, long start):
    self.getitem_(nslot, nparr.data, start)

  cdef getitem_(self, long nslot, void *data, long start):
    cdef void *cachedata
    cachedata = self.getitem1_(nslot)
    # Copy the data in cache to destination
    memcpy(<char *>data + start * self.itemsize, cachedata,
           self.slotsize * self.itemsize)

calling MyClass.getitem_() from another extension will save you the 
Python call.  This is not really important for most of occasions, but 
it can certainly be in others.

My two cents,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"