[SciPy-user] Handle large array
Francesc Alted
faltet at pytables.org
Fri Jan 16 04:34:50 EST 2009
A Friday 16 January 2009, Robert Kern escrigué:
> On Fri, Jan 16, 2009 at 02:23, Tan Tran <fragon25 at yahoo.com> wrote:
> > Hello,
> >
> > I'm trying to do some & like this
> >
> > xx = (d[:,0:1] == 0) & (d[:,2:3] == 2) & (d[:, 1:2]==1) & (d[:,
> > 1:2]==2)
> >
> > If d is small, 19 columns and about 5000 rows, the code runs fine.
> > But if I have large data like d has about 40k rows, I got error
> > message: MemoryError
> >
> > I tried to make separate variable but still have problem when
> > trying to & them
> > aa = d[:,0:1] == 0
> > bb = d[:,2:3] == 2
> > cc = d[:, 1:2]==1
> > dd = d[:, 1:2]==2
> >
> > xx = aa & bb & cc & dd <-- MemoryError's here
> >
> > Have anybody seen this problem before? How to play with large data?
>
> I usually chunk things up using iterators. For example:
>
>
> def chunked_slices(ntotal, chunksize):
> nchunks, nlast = divmod(ntotal, chunksize)
> for i in range(nchunks):
> yield slice(i*chunksize, (i+1)*chunksize)
> if nlast > 0:
> penultimate = (i+1)*chunksize
> yield slice(penultimate, penultimate+nlast)
>
> xx = np.empty([len(d)], dtype=bool)
>
> for slc in chunked_slices(len(d), 1000):
> xx[slc] = (d[slc,0] == 0) & (d[slc,2] == 2) & (d[slc,1]==1) &
> (d[slc,1]==2)
Another option could be using numexpr [1], that avoid the use of
temporaries during the expression evaluation:
xx = numexpr.evaluate("aa & bb & cc & dd")
However, I think that your problem here is that your initial array, d,
is too large and takes almost all of your available memory. You may
want to save it into a file a read columns from it when you need them.
There are several ways to achieve this, like memmapped arrays or using
HDF5/NetCDF4 for saving them. Here it is a quick example following the
HDF5 path (through PyTables [2]):
In [1]: import numpy as np
In [2]: import tables as tb
In [3]: import tables.numexpr as ne
In [4]: f = tb.openFile('mydata.h5', 'w')
In [5]: d = f.createCArray(f.root, 'mydata', tb.Int32Atom(), (19,40000))
In [6]: for ncol in range(19): # Write data column by column
...: d[ncol] = np.arange(40000)*ncol
...:
In [7]: a, b, c = d[0,:], d[1,:], d[2,:]
In [8]: xx = ne.evaluate('(a == 0) & (c == 2) & (b == 1) & (b == 2)')
In [9]: xx
Out[9]: array([False, False, False, ..., False, False, False],
dtype=bool)
In [10]: f.close()
With this, you will only have 4 columns (a,b,c and xx) of your data as
maximum in memory while the d array is completely on disk. Note that
I've transposed your original d array for read efficiency reasons.
Also, numexpr is already integrated in PyTables, so you don't need to
install it separately (although you can if you want).
[1] http://code.google.com/p/numexpr/
[2] http://www.pytables.org
Hope that helps,
--
Francesc Alted
More information about the SciPy-User
mailing list