[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Tue Jul 6 14:51:20 EDT 2010

> My opinion on the matter is that, as a matter of "purity," labels
> should all have the string datatype. That said, I'd imagine that
> passing an int as an argument would be fine, due to python's
> loosey-goosey attitude towards datatypes. :) That, or, y'know,
> str(myint).

That's kind of what I went for in sciexp2. Integers are maintained to index the
structure, and strings are internally translated into the real integers or lists
of them (e.g., a filter, see below).

All translation into the real integers happens in the Dimension object [1] (an
Axis in datarray), which supports all the indexing methods in numpy (slices,
iterables, etc), plus what I call filters (i.e., slicing by tick values) [2]

If you download the code, you can see the documentation for the user API in a
nicer way with './sciexp2/trunk/plotter -d'.

After looking into [3], sciexp2 seems conceptually equivalent to datarray. The
main difference I see is that sciexp2 supports "compound" ticks, in the sense
that, for me, ticks are formed by a sequence of variables meaningful to the
user, which are merged into a single unique string following a user-provided
expression:

      Dimension.expression <- "@PARAM1 at -@PARAM2@"
      Dimension.contents <- ["1-z1", "1-z2", "2-z1", "2-z5", ...]

So that the user is able not only to index through tick strings (e.g.,
data["v1-z1"]), but also to arbitrarily slice the structure according to each of
the separate values of each variable (e.g., data[::"PARAM1 <= 3 && PARAM2 ==
'z6'"] or any other boolean expression involving any or both of PARAM1 and
PARAM2).

The other difference is that the Data object in sciexp2 also uses record arrays
(but not recarrays, as the documentation talked about extra costs). The idea is
that record fields contain the results of a single experiment, and experiment
parameters (one "variable" for each experiment parameter) are arbitrarily mapped
into axis/dimensions (thus, the "values" of experiment parameters form the
ticks/indexes of that dimension). This allows the user to store heterogeneous
results on a single 'Data' object (e.g., mix integers, floats, strings, dates,
etc).

As a final note, and as there is no formal documentation for the plotter part
(only the API documentation), you can quickly test it with './sciexp2/plotter
-i' (opens an IPython shell with everything imported).

Then, suppose you have various csv files, with a header line describing each
column, and path names are 'foo/bar-baz.results':

     find_files("@FOO@/@BAR at -@BAZ at .results")
     extract(default_source, "csv", count="LINE")

     # build a Data with 1 dimension
     data = from_rawdata(default_rawdata)
     print data.ndim, data.dim().expression
     print list(data.dim())

     # reshape to multiple dimensions
     rdata = data.reshape(["FOO"], ["BAR", "BAZ"], ["LINE"])
     print rdata.ndim, rdata.dim(0).expression, rdata.dim(1).expression
     print list(rdata.dim(0))
     print list(rdata.dim(1))

     # now you can start playing with accesses to ticks (as returned by previous
     # prints), lists of those, slices or filters (e.g., rdata[::"FOO ==
     # 'foo1'"])

     # you can also access record fields by means of 'data.name'

     # if you put this in a file, simply execute './sciexp2/plotter -f file',
     # and at the end:
     shell()

apa!

Footnotes: 
[1]  https://projects.gso.ac.upc.edu/projects/sciexp2/repository/entry/trunk/sciexp2/data/__init__.py#L762
[2]  https://projects.gso.ac.upc.edu/projects/sciexp2/repository/entry/trunk/sciexp2/data/__init__.py#L561
[3]  http://jesusabdullah.github.com/2010/07/02/datarray.html

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth