[Numpy-discussion] Record arrays

Thu Jun 26 16:25:11 EDT 2008

On Thu, Jun 26, 2008 at 15:13, Dan Yamins <dyamins at gmail.com> wrote:
>
> On Thu, Jun 26, 2008 at 3:34 PM, Gael Varoquaux
> <gael.varoquaux at normalesup.org> wrote:
>>
>> On Thu, Jun 26, 2008 at 11:48:06AM -0500, John Hunter wrote:
>> > I personally think they are the best thing since sliced bread, and
>> > everyone here who uses them becomes immediately addicted to them.  I
>> > would like to see better support for them, especially making the attrs
>> > exposed to dir so tab completion would work.
>>
>> > People in the financial/business world work with spreadsheet data a
>> > lot, and record arrays are the natural data structure to represent
>> > tabular, heterogeneous data.    If you work with this data all day,
>> > you save a lot of ugly keystrokes doing r.date rather than r['date'],
>> > and the code is prettier in my opinion.
>>
>> I am +1 on all that.
>>
>
> I also completely second this.  I use them all the time -- for finance data
> as well as biological/genomics data.  It is essential for these applications
> to have spread-sheet like objects that can have mixed types and from which
> good numpy numerical arrays can be extracted when necessary.   I hope to
> continue having access to them or something like them.  I also hope that
> they will be better documented, since not only do I use them all the time,
> I'm hoping to teach their use to many more people whom I am training and in
> spread-sheet like data analysis.
>
> (If they have some flaw I don't understand, it would be great if someone
> could explain it to me.   And if there's something out there that fixes that
> flaw, I'd love to hear about it.  But it seems to me at least that recarrays
> are very useful.)

Let's be clear, there are two very closely related things: recarrays
and record arrays. Record arrays are just ndarrays with a complicated
dtype. E.g.

In [1]: from numpy import *

In [2]: ones(3, dtype=dtype([('foo', int), ('bar', float)]))
Out[2]:
array([(1, 1.0), (1, 1.0), (1, 1.0)],
      dtype=[('foo', '<i4'), ('bar', '<f8')])

In [3]: r = _

In [4]: r['foo']
Out[4]: array([1, 1, 1])

recarray is a subclass of ndarray that just adds attribute access to
record arrays.

In [10]: r2 = r.view(recarray)

In [11]: r2
Out[11]:
recarray([(1, 1.0), (1, 1.0), (1, 1.0)],
      dtype=[('foo', '<i4'), ('bar', '<f8')])

In [12]: r2.foo
Out[12]: array([1, 1, 1])

One downside of this is that the attribute access feature slows down
all field accesses, even the r['foo'] form, because it sticks a bunch
of pure Python code in the middle. Much code won't notice this, but if
you end up having to iterate over an array of records (as I have),
this will be a hotspot for you.

Record arrays are fundamentally a part of numpy, and no one is even
suggesting that they would go away. No one is seriously suggesting
that we should remove recarray, but some of us hesitate to recommend
its use over plain record arrays.

Does that clarify the discussion for you?

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco