[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
Lluís
xscript at gmx.net
Thu Jul 8 13:38:08 EDT 2010
Skipper Seabold writes:
> On Thu, Jul 8, 2010 at 12:02 PM, Rob Speer <rspeer at mit.edu> wrote:
[...]
>> My proposal is that datarray.row should be equivalent to
>> datarray.axes[0], and datarray.column should be equivalent to
>> datarray.axes[1], so that you can always ask for something like
>> "arr.column.named(2010)" (replace those with square brackets if you
>> like).
>>
>> Not sure yet what the right way is to generalize this to 1-D and n-D.
> I think we have to start from the nD case, even if I (and I think most
> users) will tend to think in 2D. The rest is just going to have to be
> up to developers how they want users to interact with what we, the
> developers, see as axes. No end-user wants to think about the 6th
> axis of the data, but I don't want to be pegged into rows and columns
> thinking because I don't think it works for the below example.
You could simply provide a subclass of datarray called 'table' that
automatically labels the two (mandatory) axis as 'column' and 'row'.
[...]
> city, month, year, region, precipitation, temperature
> "Austin", "January", 1980, "South", 12.1, 65.4,
> "Austin", "February", 1980, "South", 24.3, 55.4
> "Austin", "March", 1980, "South", 3, 69.1
> ....
> "Austin", "December", 2009, 1, 62.1
> "Boston", "January", 1980, "Northeast", 1.5, 19.2
> ....
> "Boston","December", 2009, "Northeast", 2.1, 23.5
> ...
> "Memphis","January",1980, "South", 2.1, 35.6
> ...
> "Memphis","December",2009, "South", 1.2, 33.5
> ...
> Sometimes, I want, say, to know what the average temperature is in
> December. Sometimes I want to know what the average temperature is in
> Memphis. Sometimes I want to know the average temperature in Memphis
> in December or in Memphis in 1985. If I do this with structured
> arrays, most group-by type operations are at best O(n). Really this
> isn't feasible.
If I understood well, you could have 4 axes (assuming that an Axis can only
handle a single label/variable).
a = DatArray(numpy.array([...], dtype = [("precipitation", float),
("temperature", float)]),
(("city", ["Austin", ...]),
("month", ["January"]),
...))
Then, you can:
a.city.named("Memphis").month.named("December")["temperature"].mean()
a.city.named("Memphis").year.named(1985)["temperature"].mean()
Or shorter:
a.named["Memphis","December"]["temperature"].mean()
a.named["Memphis",:,"1985"]["temperature"].mean()
This raises the problem of non-homogeneous measurements. For example, if you had
only a few measurements for Austin, the rest would be just NaNs to make the
shape homogeneus.
I solved this in sciexp2 with (this is not the API, but translated into a
DatArray-like interface for clarity):
a = Data(numpy.array([...], dtype = [("precipitation", float),
("temperature", float)]),
(("measurement", "@city at -@month at -@year at -@region@",
[{"city": "Austin", "month": "January", "year": 1980, "region": "South"},
...])))
a.named[::"city == 'Memphis' && month == 'December'"]["temperature"].mean()
a.named[::"city == 'Memphis' && year == 1985"]["temperature"].mean()
But of course, this represents a tradeoff between "wasted" space and speed. The
internals are on the line of (using ordered dicts):
{ 'city' : { 'Memphis': set(<indexes with memphis>),
... },
'month' : { 'December': set(<indexes with december>),
... },
... }
Which translates into:
a[union( d['city']['Memphis'], d['month']['december'] )]
There's a less optimized path that supports arbitrary expressions (less than,
more than or equal, etc.), but has a cost of O(n).
> An even more difficult question is what if I want descriptive
> statistics on the "region" variable? Ie., I want to know how many
> observations I have for each region. This one can wait, but is still
> important for doing statistics.
This _should_ be:
a.region.named("South").size
Read you,
Lluis
--
"And it's much the same thing with knowledge, for whenever you learn
something new, the whole world becomes that much richer."
-- The Princess of Pure Reason, as told by Norton Juster in The Phantom
Tollbooth
More information about the NumPy-Discussion
mailing list