[Numpy-discussion] Loading a > GB file into array

Kurt Smith kwmsmith at gmail.com
Fri Nov 30 13:00:36 EST 2007


On Nov 30, 2007 2:47 AM, Martin Spacek <numpy at mspacek.mm.st> wrote:
> I need to load a 1.3GB binary file entirely into a single numpy.uint8
> array. I've been using numpy.fromfile(), but for files > 1.2GB on my
> win32 machine, I get a memory error. Actually, since I have several
> other python modules imported at the same time, including pygame, I get
> a "pygame parachute" and a segfault that dumps me out of python:
>
> data = numpy.fromfile(f, numpy.uint8) # where f is the open file
>
> 1382400000 items requested but only 0 read
> Fatal Python error: (pygame parachute) Segmentation Fault

You might try numpy.memmap -- others have had success with it for
large files (32 bit should be able to handle a 1.3 GB file, AFAIK).

See for example:

http://www.thescripts.com/forum/thread654599.html

Kurt


>
> If I stick to just doing it at the interpreter with only numpy imported,
> I can open up files that are roughly 100MB bigger, but any more than
> that and I get a clean MemoryError. This machine has 2GB of RAM. I've
> tried setting the /3GB switch on winxp bootup, as well as all the
> registry suggestions at
> http://www.msfn.org/board/storage-process-command-t62001.html. No luck.
> I get the same error in (32bit) ubuntu for a sufficiently big file.
>
> I find that if I load the file in two pieces into two arrays, say 1GB
> and 0.3GB respectively, I can avoid the memory error. So it seems that
> it's not that windows can't allocate the memory, just that it can't
> allocate enough contiguous memory. I'm OK with this, but for indexing
> convenience, I'd like to be able to treat the two arrays as if they were
> one. Specifically, this file is movie data, and the array I'd like to
> get out of this is of shape (nframes, height, width). Right now I'm
> getting two arrays that are something like (0.8*nframes, height, width)
> and (0.2*nframes, height, width). Later in the code, I only need to
> index over the 0th dimension, i.e. the frame index.
>
> I'd like to access all the data using a single range of frame indices.
> Is there any way to combine these two arrays into what looks like a
> single array, without having to do any copying within memory? I've tried
> using numpy.concatenate(), but that gives me a MemoryError because, I
> presume, it's doing a copy. Would it be better to load the file one
> frame at a time, generating nframes arrays of shape (height, width), and
> sticking them consecutively in a python list?
>
> I'm using numpy 1.0.4 (compiled from source tarball with Intel's MKL
> library) on python 2.5.1 in winxp.
>
> Thanks for any advice,
>
> Martin
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>



More information about the NumPy-Discussion mailing list