[Numpy-discussion] multidimensional record arrays

Fri Jul 16 13:19:00 EDT 2004

There have been a number of questions and suggestions about
how the record array facility in numarray could be improved.
We've been talking about these internally and thought it would
be useful to air some proposals along with discussions of the
rationale behind each proposal as well discussions of drawbacks,
and some remaining open questions. Rather than do this in one
long message, we will do this in pieces. The first addresses
how to improve handling multidimensional record arrays.

These will not discuss how or when we implement the proposed
enhancements or changes. We first want to come to some
consensus (or lacking that, decision) first about what the
target should be.

*********************************************************

Proposal for records module enhancement, to handle record arrays of
dimension (rank) higher than 1.

Background:

The current records module in numarray doesn't handle record arrays of
dimension higher than one well.  Even though most of the infrastructure
for higher dimensionality is already in place, the current implementation
for the record arrays was based on the implicit assumption that record
arrays are 1-D. This limitation is reflected in the areas of input user
interface, indexing, and output.

The indexing and output are more straightforward to modify, so I'll
discuss it first.

Although it is possible to create a multi-dimensional record array,
indexing does not work properly for 2 or more dimensions.  For example,
for a 2-D record array r, r[i,j] does not give correct result (but r[i][j]
does). This will be fixed.

At present, a user cannot print record arrays higher than 1-D.  This will
also be fixed as well as incorporating some numarray features (e.g.,
printing only the beginning and end of an array for large arrays--as is done
for numarrays now).

Input Interface:

There are currently several different ways to construct the record array
using the array() function These include setting the buffer argument to:

(1) None
(2) File object
(3) String object or appropriate buffer object (i.e., binary data)
(4) a list of records (in the form of sequences),
    for example:  [(1,'abc', 2.3), (2,'xyz', 2.4)]
(5) a list of numarrays/chararrays for each field (e.g., effectively
 'zipping' the arrays into records)

The first three types of input are very general and can be used to generate
multi-dimensional record arrays in the current implementation.  All these
options need to specify the "shape" argument.

The input options that do not work for multi-dimensional record arrays now
are the last two.

Option 4 (sequence of 'records')

If a user has a multi-dimensional record array and if one or more field is
also a multidimensional array, using this option is potentially confusing
since there can be ambiguity regarding what part of a nested sequence
structure is the structure of the record array and what should be considered
part of the record since record elements themselves may be arrays. (Some of
the same issues arise for object arrays)

As an example:

--> r=rec.array([([1,2],[3,4]),([11,12],[13,14])])

could be interpreted as a 1-D record array, where each cell is an
(num)array:

RecArray[
(array([1, 2]), array([3, 4])),
(array([11, 12]), array([13, 14]))
]

or a 2-D record array, where each cell is just a number:

RecArray(
             [[(1, 2),
              (3, 4)],

             [(11, 12),
              (13, 14)]])

Thus we propose a new argument "rank" (following the convention used in
object arrays) to specify the dimensionality of the output record array.  In
the first example above, rank is 1, and the second example rank=2.  If rank
is set to None, the highest possible rank will be assumed (in this example,
2).

We propose to eventually generalize that to accept any sequence object for
the array structure (though there will be the same requirement that exist
for
other arrays that the nested sequences be of the same type). As would be
expected, strings are not permitted as the enclosing sequence. In this
future implementation the record 'item' itself must either be:

1) A tuple
2) A subclass of tuple
3) A Record object (this may be taken care of by 2 if we make Record
   a subclass of tuple; this will be discussed in a subsequent proposal.

This requirement allows distinguishing the sequence of records from Option 5
below. For tuples (or tuple derived elements), the items of the tuple must
be one of the following: basic data types such as int, float, boolean, or
string; a numarray or chararray; or an object that can be converted to a
numarray or chararray.

Option 5 (List of Arrays)

Using a list of arrays to construct an N-D record array should be easier
Than using the previous option.  The input syntax is simply:

[array1, array2, array3,...]

The shape of the record array will be determined from the shape of the input
arrays as described below. All the user needs to do is to construct the
arrays in the list.  There is, similar to option 4, a possible ambiguity:
if all the arrays are of the shape, say, (2,3), then the user may intend a
1-D record array of 2 rows while each cell is an array of shape 3, or a 2-D
record array of shape (2,3) while each cell is a single number of string.
Thus, the user must either explicitly specify the "shape" or "rank".

We propose the following behavior via examples:

Example 1:

given:

array1.shape=(2,3,4,5)
array2.shape=(2,3,4)
array3.shape=(2,3)

Rank can only be specified as rank=1 (the record array's shape will then be
(2,)) or rank=2 (the record array's shape will then be (2,3)). For rank=None
the record shape will be (2,3), i.e. the "highest common denominator": each
cell in the first field will be an array of shape (4,5), each cell in the
second field will be an array of shape (4,), and each cell in the 3rd field
will be a single number or a string.  If "shape" is specified, it will take
precedence over "rank" and its allowed value in this example will be either
2, or (2,3).

Example 2:

array1.shape=(3,4,5)
array2.shape=(4,5)

this will raise exception because the 'slowest' axes do not match.

*********

For both the sequence of records and list-of-arrays input options, we
Propose the default value for "rank" be None (current default is 1).
This gives consistent behavior with object arrays but does change the
current behavior.

Also for both cases specifying a shape inconsistent with the supplied data
will raise an exception.