[Numpy-discussion] reading gzip compressed files using numpy.fromfile

Wed Oct 28 15:33:11 EDT 2009

On Wed, Oct 28, 2009 at 14:31, Peter Schmidtke <pschmidtke at mmb.pcb.ub.es> wrote:
> Dear Numpy Mailing List Readers,
>
> I have a quite simple problem, for what I did not find a solution for now.
> I have a gzipped file lying around that has some numbers stored in it and I
> want to read them into a numpy array as fast as possible but only a bunch
> of data at a time.
> So I would like to use numpys fromfile funtion.
>
> For now I have somehow the following code :
>
>
>
>        f=gzip.open( "myfile.gz", "r" )
> xyz=npy.fromfile(f,dtype="float32",count=400)
>
>
> So I would read 400 entries from the file, keep it open, process my data,
> come back and read the next 400 entries. If I do this, numpy is complaining
> that the file handle f is not a normal file handle :
> OError: first argument must be an open file
>
> but in fact it is a zlib file handle. But gzip gives access to the normal
> filehandle through f.fileobj.

np.fromfile() requires a true file object, not just a file-like
object. np.fromfile() works by grabbing the FILE* pointer underneath
and using C system calls to read the data, not by calling the .read()
method.

> So I tried  xyz=npy.fromfile(f.fileobj,dtype="float32",count=400)
>
> But there I get just meaningless values (not the actual data) and when I
> specify the sep=" " argument for npy.fromfile I get just .1 and nothing
> else.

This is reading the compressed data, not the data that you want.

> Can you tell me why and how to fix this problem? I know that I could read
> everything to memory, but these files are rather big, so I simply have to
> avoid this.

Read in reasonably-sized chunks of bytes at a time, and use
np.fromstring() to create arrays from them.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco