[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Lluís xscript at gmx.net
Thu Jul 8 13:38:08 EDT 2010


Skipper Seabold writes:

> On Thu, Jul 8, 2010 at 12:02 PM, Rob Speer <rspeer at mit.edu> wrote:
[...]
>> My proposal is that datarray.row should be equivalent to
>> datarray.axes[0], and datarray.column should be equivalent to
>> datarray.axes[1], so that you can always ask for something like
>> "arr.column.named(2010)" (replace those with square brackets if you
>> like).
>> 
>> Not sure yet what the right way is to generalize this to 1-D and n-D.

> I think we have to start from the nD case, even if I (and I think most
> users) will tend to think in 2D.  The rest is just going to have to be
> up to developers how they want users to interact with what we, the
> developers, see as axes.  No end-user wants to think about the 6th
> axis of the data, but I don't want to be pegged into rows and columns
> thinking because I don't think it works for the below example.

You could simply provide a subclass of datarray called 'table' that
automatically labels the two (mandatory) axis as 'column' and 'row'.


[...]
> city, month, year, region, precipitation, temperature
> "Austin", "January", 1980, "South", 12.1, 65.4,
> "Austin", "February", 1980, "South", 24.3, 55.4
> "Austin", "March", 1980, "South", 3, 69.1
> ....
> "Austin", "December", 2009, 1, 62.1
> "Boston", "January", 1980, "Northeast", 1.5, 19.2
> ....
> "Boston","December", 2009, "Northeast", 2.1, 23.5
> ...
> "Memphis","January",1980, "South", 2.1, 35.6
> ...
> "Memphis","December",2009, "South", 1.2, 33.5
> ...

> Sometimes, I want, say, to know what the average temperature is in
> December.  Sometimes I want to know what the average temperature is in
> Memphis.  Sometimes I want to know the average temperature in Memphis
> in December or in Memphis in 1985.  If I do this with structured
> arrays, most group-by type operations are at best O(n).  Really this
> isn't feasible.

If I understood well, you could have 4 axes (assuming that an Axis can only
handle a single label/variable).

a = DatArray(numpy.array([...], dtype = [("precipitation", float),
                                         ("temperature", float)]),
             (("city", ["Austin", ...]),
              ("month", ["January"]),
              ...))

Then, you can:
  a.city.named("Memphis").month.named("December")["temperature"].mean()
  a.city.named("Memphis").year.named(1985)["temperature"].mean()

Or shorter:
  a.named["Memphis","December"]["temperature"].mean()
  a.named["Memphis",:,"1985"]["temperature"].mean()

This raises the problem of non-homogeneous measurements. For example, if you had
only a few measurements for Austin, the rest would be just NaNs to make the
shape homogeneus.

I solved this in sciexp2 with (this is not the API, but translated into a
DatArray-like interface for clarity):

  a = Data(numpy.array([...], dtype = [("precipitation", float),
                                       ("temperature", float)]),
             (("measurement", "@city at -@month at -@year at -@region@",
               [{"city": "Austin", "month": "January", "year": 1980, "region": "South"},
                ...])))

  a.named[::"city == 'Memphis' && month == 'December'"]["temperature"].mean()
  a.named[::"city == 'Memphis' && year == 1985"]["temperature"].mean()

But of course, this represents a tradeoff between "wasted" space and speed. The
internals are on the line of (using ordered dicts):

  { 'city' : { 'Memphis': set(<indexes with memphis>),
               ... },
    'month' : { 'December': set(<indexes with december>),
                ... },
    ... }

Which translates into:

  a[union( d['city']['Memphis'], d['month']['december'] )]

There's a less optimized path that supports arbitrary expressions (less than,
more than or equal, etc.), but has a cost of O(n).


> An even more difficult question is what if I want descriptive
> statistics on the "region" variable?  Ie., I want to know how many
> observations I have for each region.  This one can wait, but is still
> important for doing statistics.

This _should_ be:

  a.region.named("South").size


Read you,
     Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth



More information about the NumPy-Discussion mailing list