[Numpy-discussion] Re: Re-implementation of Python Numerical arrays (Numeric) available for download

Tue Nov 20 15:23:12 EST 2001

Perry Greenfield wrote:

> > One major comment that isn't directly addressed on the web page is the
> > ease of writing new functions, I suppose Ufuncs, although I don't
> > usually care if they work on anything other than Arrays. I hope the new
> > system will make it easier to write new ones. 
<snip>

> Absolutely. We will provide examples of how to write new ufuncs. It should
> be very simple in one sense (requiring few lines of code) if our code
> generator machinery is used (but context is important here so this
> is why examples or a template is extremely important). But it isn't
> particularly hard to do without the code generator. And such ufuncs
> will handle *all* the generality of arrays including slices, non-aligned
> arrays, byteswapped arrays, and type conversion. I'd like to provide
> examples of writing ufuncs within a few weeks (along with examples
> of other kinds of functions using the C-API as well).

This sounds great! The code generting machinery sound very promising,
and examples are, of course, key. I found digging through the NumPy
source to figure out how to do things very treacherous. Making writing
Ufuncs easy will enocourage a lot more C Ufuncs to be written which
should help perfomance.

> > Also, I can't help wondering if this could leverage more existing code.
> > The blitz++ package being used by Eric Jones in the SciPy.compiler
> > project looks very promising. It's probably too late, but I'm wondering
> > what the reasons are for re-inventing such a general purpose wheel.
> >
> I'm not sure which "wheel" you are talking about :-)

The wheel I'm talking about are multi-dimensional array objects...

> We certainly
> aren't trying to replilcate what Eric Jones has done with the
> SciPy.compiler approach (which is very interesting in its own right).

I know, I just think using an existing set of C++ classes for multiple
typed multidimansional arrays would make sense, although I imagine it is
too late now!

> If the issue is why we are redoing Numeric:

Actually, I think I had a pretty good idea why you were working on this.

> 1) it has to be rewritten to be acceptable to Guido before it can be
>    part of the Standard Library.
> 2) to add new types (e.g. unsigned) and representations (e.g., non-aligned,
>    byteswapped, odd strides, etc). Using memory mapped data requires some
>    of these.
> 3) to make it more memory efficient with large arrays.
> 4) to make it more generally extensible

I'm particualry excited about 1) and 4)

> > As a whole I have found that I would like the transition from Python to
> > Compiled laguages to be smoother. The standard answer to Python
> > perfomance is to profile, and then re-write the computationally intesive
> > pertions in C. This would be a whole lot easier if Python used datatypes
> > that are easy to use from C/C++ as well as Python. I hope NumPy2 can
> > move in this direction.
> >
> What do you see as missing in numarray in that sense? Aside from UInt32
> I'm not aware of any missing type that is available on all platforms.
> There is the issue of Float128 and such. Adding these is not hard.
> The real issue is how to deal with the platforms that don't support them.

I used Poor wording. When I wrote "datatypes", I meant data types in a
much higher order sense. Perhaps structures or classes would be a better
term. What I mean is that is should be easy to use an manipulate the
same multidimensional arrays from both Python and C/C++. In the current
Numeric, most folks generate a contiguous array, and then just use the
array->data pointer to get what is essentially a C array. That's fine if
you are using it in a traditional C way, with fixed dimension, one
datatype, etc. What I'm imagining is having an object in C or C++ that
could be easily used as a multidimentional array. I'm thinking C++ would
probably neccesary, and probably templates as well, which is why blitz++
looked promising. Of course, blitz++ only compiles with a few up-to-date
compilers, so you'd never get it into the standard library that way!

This could also lead the way to being able to compile NumPy code....<end
fantasy>

> I think it is pretty easy to install since it use distutils.

I agree, but from the newsgroup, it is clear that a lot of folks are
very reluctant to use something that is not part of the standard
library.

> > >    We estimate
> > >    that numarray is probably another order of magnitude worse,
> > >    i.e., that 20K element arrays are at half the asymptotic
> > >    speed. How much should this be improved?
> >
> > A lot. I use arrays smaller than that most of the time!
> >
> What is good enough? As fast as current Numeric?

As fast as current Numeric would be "good enough" for me. It would be a
shame to go backwards in performance!

> (IDL does much
> better than that for example).

My personal benchmark is MATLAB, which I imagine is similar to IDL in
performance.

> 10 element arrays will never be
> close to C speed in any array based language embedded in an
> interpreted environment.

Well, sure, I'm not expecting that

> 100, maybe, but will be very hard.
> 1000 should be possible with some work.

I suppose MATLAB has it easier, as all arrays are doubles, and, (untill
recently anyway), all variable where arrays, and all arrays were 2-d.
NumPy is a lot more flexible that that. Is is the type and size checking
that takes the time?

> Another approach is to try to cast many of the functions as being
> able to broadcast over repeated small arrays. After all, if one
> is only doing a computation on one small array, it seems unlikely
> that the overhead of Python will be objectionable. Only if you
> have many such arrays to repeat calculations on, should it be
> a problem (or am I wrong about that).

You are probably right about that.

> If these repeated calculations
> can be "assembled"  into a higher dimensionality array (which
> I understand isn't always possible) and operated on in that sense,
> the efficiency issue can be dealt with.

I do that when possible, but it's not always possible.

> But I guess this can only
> be seen with specific existing examples and programs. I would
> be interested in seeing the kinds of applications you have now
> to gauge what the most effective solution would be.

One of the things I do a lot with are coordinates of points and
polygons. Sets if points I can handle easily as an NX2 array, but
polygons don't work so well, as each polgon has a different number of
points, so I use a list of arrays, which I have to loop over. Each
polygon can have from about 10 to thousands of points (mostly 10-20,
however). One way I have dealt with this is to store a polygon set as a
large array of all the points, and another array with the indexes of the
start and end of each polygon. That way I can transform the coordinates
of all the polygons in one operation. It works OK, but sometimes it is
more useful to have them in a sequence. 

> As mentioned,
> we tend to deal with large data sets and so I don't think we have
> a lot of such examples ourselves.

I know large datasets were one of your driving factors, but I really
don't want to make performance on smaller datasets secondary.

I hope I'll get a chance to play with it soon....

-Chris

-- 
Christopher Barker,
Ph.D.                                                           
ChrisHBarker at home.net                 ---           ---           ---
http://members.home.net/barkerlohmann ---@@       -----@@       -----@@
                                   ------@@@     ------@@@     ------@@@
Oil Spill Modeling                ------   @    ------   @   ------   @
Water Resources Engineering       -------      ---------     --------    
Coastal and Fluvial Hydrodynamics --------------------------------------
------------------------------------------------------------------------