[Numpy-discussion] RE: default axis for numarray

Tue Jun 11 10:45:01 EDT 2002

> "eric jones" <eric at enthought.com> writes:
> 
> > The issue here is both consistency across a library and speed.
> 
> Consistency, fine. But not just within one package, also between
> that package and the language it is implemented in.
> 
> Speed, no. If I need a sum along the first axis, I won't replace
> it by a sum across the last axis just because that is faster.

The default axis choice influences how people choose to lay out their
data in arrays.  If the default is to sum down columns, then users lay
out their data so that this is the order of computation.  This results
in strided operations.  There are cases where you need to reduce over
multiple data sets, etc. which is what the axis=? flag is for.  But
choosing the default to also be the most efficient just makes sense.

The cost is even higher for wrappers around C libraries not written
explicitly for Python (which is most of them), because you have to
re-order the memory before passing the variables into the C loop.  Of
course, the axis=0 is faster for Fortran libraries with wrappers that
are smart enough to recognize this (Pearu's f2py wrapped libraries now
recognize this sort of thing).  However, the marriage to C is more
important as future growth will come in this area more than Fortran.

> 
> > >From the numpy.pdf, Numeric looks to have about 16 functions using
> > axis=0 (or index=0 which should really be axis=0) and, counting FFT,
> > about 10 functions using axis=-1.  To this day, I can't remember
which
> 
> If you weight by frequency of usage, the first group gains a lot in
> importance. I just scanned through some of my code; almost all of the
> calls to Numeric routines are to functions whose default axis
> is zero.

Right, but I think all the reduce operators (sum, product, etc.) should
have been axis=-1 in the first place.

> 
> > code.  Unfortunately, many of the Numeric functions that should
still
> > don't take axis as a keyword, so you and up just inserting -1 in the
> 
> That is certainly something that should be fixed, and I suppose no one
> objects to that.

Sounds like Travis already did it.  Thanks.

> 
> 
> My vote is for keeping axis defaults as they are, both because the
> choices are reasonable (there was a long discussion about them in the
> early days of NumPy, and the defaults were chosen based on other array
> languages that had already been in use for years) and because any
> change would cause most existing NumPy code to break in many places,
> often giving wrong results instead of an error message.
> 
> If a uniformization of the default is desired, I vote for axis=0,
> for two reasons:
> 1) Consistency with Python usage.

I think the consistency with Python is less of an issue than it seems.
I wasn't aware that add.reduce(x) would generated the same results as
the Python version of reduce(add,x) until Perry pointed it out to me.
There are some inconsistencies between Python the language and Numeric
because the needs of the Numeric community.  For instance, slices create
views instead of copies as in Python.  This was a correct break with
consistency in a very utilized area of Python because of efficiency.  

I don't see choosing axis=-1 as a break with Python -- multi-dimensional
arrays are inherently different and used differently than lists of lists
in Python.  Further, reduce() is a "corner" of the Python language that
has been superceded by list comprehensions.  Choosing an alternative
behavior that is generally better for array operations, as in the case
of slices as views, is worth the change.

> 2) Minimization of code breakage.

Fixes will be necessary for sure, and I wish that wasn't the case.  They
will be necessary if we choose a consistent interface in either case.
Choosing axis=0 or axis=-1 will not change what needs to be fixed --
only the function names searched for.

> 
> 
> > We should also strive to make it as easy as possible to write
generic
> > functions that work for all array types (Int, Float,Float32,Complex,
> > etc.) -- yet another debate to come.
> 
> What needs to be improved in that area?

Comparisons of complex numbers.  But lets save that debate for later.

> 
> > Changes are going to create some backward incompatibilities and that
is
> > definitely a bummer.  But some changes are also necessary before the
> > community gets big.  I know the community is already reasonable
size,
> 
> I'd like to see evidence that changing the current NumPy behaviour
> would increase the size of the community. It would first of all split
> the current community, because many users (like myself) do not have
> enough time to spare to go through their code line by line in order to
> check for incompatibilities. That many others would switch to Python
> if only some changes were made is merely an hypothesis.

True.  But I can tell you that we're definitely doing something wrong
now.  We have a superior language that is easier to integrate with
legacy code and less expensive than the best competing alternatives.
And, though I haven't done a serious market survey, I feel safe in
saying we have significantly less than 1% of the potential user base.
Even in communities where Python is relatively prevalent like astronomy,
I would bet the every-day user base is less than 5% of the whole.  There
are a lot of holes to fill (graphics, comprehensive libraries, etc.)
before we get up to the capabilities and quality of user interface that
these tools have.  Some of the interfaces problems are GUI and debugger
related.  Others are API related.  Inconsistency in a library interface
makes it harder to learn and is a wart.  Whether it is as important as a
graphics library?  Probably not.  But while we're building the next
generation tool, we should fix things that make people wonder "why did
they do this?".  It is rarely a single thing that makes all the
difference to a prospective user switching over.  It is the overall
quality of the tool that will sway them.

> 
> > > Some feel that is contrary to expectations that the least rapidly
> > > varying dimension should be operated on by default. There are
> > > good arguments for both sides. For example, Konrad Hinsen has
> 
> Actually the argument is not for the least rapidly varying
> dimension, but for the first dimension. The internal data layout
> is not significant for most Python array operations. We might
> for example offer a choice of C style and Fortran style data layout,
> enabling users to choose according to speed, compatibility, or
> just personal preference.

In a way, as Pearu has shown in f2py, this is already possible by
jiggering the stride and dimension entries, so this doesn't even require
a change to the array descriptor (I don't think...).  We could supply
functions that returned a Fortran layout array.  This would be
beneficial for some applications outside of what we're discussing now
that use Fortran extensions heavily.  As long as it is transparent to
the extension writer (which I think it can be) it sounds fine.  

I think the default constructor should return a C layout array though,
and will be what 99% of the users will use.

eric

> 
> Konrad.
> --
>
------------------------------------------------------------------------
--
> -----
> Konrad Hinsen                            | E-Mail:
hinsen at cnrs-orleans.fr
> Centre de Biophysique Moleculaire (CNRS) | Tel.: +33-2.38.25.56.24
> Rue Charles Sadron                       | Fax:  +33-2.38.63.15.17
> 45071 Orleans Cedex 2                    | Deutsch/Esperanto/English/
> France                                   | Nederlands/Francais
>
------------------------------------------------------------------------
--
> -----