[SciPy-user] Handle large array

Fri Jan 16 04:34:50 EST 2009

A Friday 16 January 2009, Robert Kern escrigué:
> On Fri, Jan 16, 2009 at 02:23, Tan Tran <fragon25 at yahoo.com> wrote:
> > Hello,
> >
> > I'm trying to do some & like this
> >
> > xx = (d[:,0:1] == 0) & (d[:,2:3] == 2) & (d[:, 1:2]==1) & (d[:,
> > 1:2]==2)
> >
> > If d is small, 19 columns and about 5000 rows, the code runs fine.
> > But if I have large data like d has about 40k rows, I got error
> > message: MemoryError
> >
> > I tried to make separate variable but still have problem when
> > trying to & them
> > aa = d[:,0:1] == 0
> > bb =  d[:,2:3] == 2
> > cc = d[:, 1:2]==1
> > dd = d[:, 1:2]==2
> >
> > xx = aa & bb & cc & dd <-- MemoryError's here
> >
> > Have anybody seen this problem before? How to play with large data?
>
> I usually chunk things up using iterators. For example:
>
>
> def chunked_slices(ntotal, chunksize):
>     nchunks, nlast = divmod(ntotal, chunksize)
>     for i in range(nchunks):
>         yield slice(i*chunksize, (i+1)*chunksize)
>     if nlast > 0:
>         penultimate = (i+1)*chunksize
>         yield slice(penultimate, penultimate+nlast)
>
> xx = np.empty([len(d)], dtype=bool)
>
> for slc in chunked_slices(len(d), 1000):
>     xx[slc] = (d[slc,0] == 0) & (d[slc,2] == 2) & (d[slc,1]==1) &
> (d[slc,1]==2)

Another option could be using numexpr [1], that avoid the use of 
temporaries during the expression evaluation:

xx = numexpr.evaluate("aa & bb & cc & dd")

However, I think that your problem here is that your initial array, d, 
is too large and takes almost all of your available memory.  You may 
want to save it into a file a read columns from it when you need them.  
There are several ways to achieve this, like memmapped arrays or using 
HDF5/NetCDF4 for saving them.  Here it is a quick example following the 
HDF5 path (through PyTables [2]):

In [1]: import numpy as np

In [2]: import tables as tb

In [3]: import tables.numexpr as ne

In [4]: f = tb.openFile('mydata.h5', 'w')

In [5]: d = f.createCArray(f.root, 'mydata', tb.Int32Atom(), (19,40000))

In [6]: for ncol in range(19):  # Write data column by column
   ...:     d[ncol] = np.arange(40000)*ncol
   ...:

In [7]: a, b, c = d[0,:], d[1,:], d[2,:]

In [8]: xx = ne.evaluate('(a == 0) & (c == 2) & (b == 1) & (b == 2)')

In [9]: xx
Out[9]: array([False, False, False, ..., False, False, False], 
dtype=bool)

In [10]: f.close()

With this, you will only have 4 columns (a,b,c and xx) of your data as 
maximum in memory while the d array is completely on disk.  Note that 
I've transposed your original d array for read efficiency reasons.  
Also, numexpr is already integrated in PyTables, so you don't need to 
install it separately (although you can if you want).

[1] http://code.google.com/p/numexpr/
[2] http://www.pytables.org

Hope that helps,

-- 
Francesc Alted