[Numpy-discussion] Can I add rows and columns to recarray?

Mon Dec 6 18:06:54 EST 2010

A Monday 06 December 2010 22:00:29 Wai Yip Tung escrigué:
> Thank you for the quick response and Christopher's explanation on the
> design background.
> 
> All my tables fit in-memory. I want to explore the data interactively
> and relational database is does not provide me a lot of value.
> 
> I was rolling my own library before I come to numpy. Then I find
> numpy's universal function awesome and really fit what I want to do.
> Now I just need to find out what to add row which is easy in Python.
> It is OK if it rebuild an array when I add a column, which should
> happen infrequently. But if adding row build a new array, this will
> lead to O(n^2) complexity. In anycase, I will explore the
> recfunctions.

If you want a container with a better complexity for adding columns  
than O(n^2), you may want to have a look at the ctable object in carray 
package:

https://github.com/FrancescAlted/carray

carray is about providing compressed, in-memory data containers for both 
homogeneous (arrays) and heterogeneous data (structured arrays).  Here 
it is an example of use:

>>> import numpy as np
>>> import carray as ca
>>> NR = 1000*1000
>>> r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8")
>>> new_field = np.arange(NR, dtype='f8')**3
>>> rc = ca.ctable(r)
>>> rc
ctable((1000000,), [('f0', '<i4'), ('f1', '<i8')])
  nbytes: 11.44 MB; cbytes: 1.71 MB; ratio: 6.70
[(0, 0), (1, 1), (2, 4), ..., (999997, 999994000009), (999998, 
999996000004), (999999, 999998000001)]
>>> time rc.addcol(new_field, "f2")
CPU times: user 0.03 s, sys: 0.00 s, total: 0.03 s
Wall time: 0.03 s

that is, only 30 ms for appending a column.  This is basically the time 
to copy (and compress) the data (i.e. O(n)).  If you append an already 
compressed column, the cost of adding it is O(1):

>>> r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8")
>>> rc = ca.ctable(r)
>>> cnew_field = ca.carray(np.arange(NR, dtype='f8')**3)
>>> time rc.addcol(cnew_field, "f2")
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

On his hand, using plain structured arrays is pretty more costly:

>>> import numpy.lib.recfunctions as nprf
>>> time r2 = nprf.rec_append_fields(r, 'f2', new_field, 'f8')
CPU times: user 0.34 s, sys: 0.02 s, total: 0.36 s
Wall time: 0.36 s

Appending data at the end of ctable objects is also very fast:

>>> timeit rc.append(row)
100000 loops, best of 3: 13.1 µs per loop

Compare this with an append with an structured array:

>>> timeit np.concatenate((r2, row))
100 loops, best of 3: 6.84 ms per loop

Unfortunately you cannot do the full range of operations supported by 
structured arrays with ctables, and a ctable object is rather meant to 
be used as an efficient, compressed container for structures in memory:

>>> r2[2]
(2, 4, 8.0)
>>> rc[2]
(2, 4, 8.0)
>>> r2['f1']
array([0, 1, 4, ..., 1, 1, 1])
>>> rc['f1']
carray((1452223,), int64)  nbytes: 11.08 MB; cbytes: 1.62 MB; ratio: 
6.85
  cparams := cparams(clevel=5, shuffle=True)
[0, 1, 4, ..., 1, 1, 1]

But still, you can do funny things like complex queries:

>>> [r for r in rc.getif("(f0<10)&(f2>4)", ["__nrow__", "f1"])]
[(2, 4),
 (3, 9),
 (4, 16),
 (5, 25),
 (6, 36),
 (7, 49),
 (8, 64),
 (9, 81),
 (1041112, 1)]

The queries are also very fast (both Numexpr and Blosc are used under 
the hood):

>>> timeit [r for r in rc.getif("(f0<10)&(f2>4)")]
10 loops, best of 3: 58.6 ms per loop
>>> timeit r2[(r2['f0']<10)&(r2['f2']>4)]
10 loops, best of 3: 28 ms per loop

So, queries on ctables are only 2x slower than using plain structured 
arrays  --of course, the secret goal is to make these sort of queries 
actually faster than using structured arrays :)

I still need to finish the docs, but I plan to release carray 0.3 later 
this week.

Cheers,

-- 
Francesc Alted