Skipping bytes while reading a binary file?

Lionel lionel.keene at gmail.com
Thu Feb 5 17:22:35 EST 2009


Hello,
I have data stored in binary files. Some of these files are
huge...upwards of 2 gigs or more. They consist of 32-bit float complex
numbers where the first 32 bits of the file is the real component, the
second 32bits is the imaginary, the 3rd 32-bits is the real component
of the second number, etc.

I'd like to be able to read in just the real components, load them
into a numpy.ndarray, then load the imaginary coponents and load them
into a numpy.ndarray.  I need the real and imaginary components stored
in seperate arrays, they cannot be in a single array of complex
numbers except for temporarily. I'm trying to avoid temporary storage,
though, because of the size of the files.

I'm currently reading the file scanline-by-scanline to extract rows of
complex numbers which I then loop over and load into the real/
imaginary arrays as follows:


        self._realData         = numpy.empty((Rows, Columns), dtype =
numpy.float32)
        self._imaginaryData = numpy.empty((Rows, Columns), dtype =
numpy.float32)

        floatData = array.array('f')

        for CurrentRow in range(Rows):

            floatData.fromfile(DataFH, (Columns*2))

            position = 0
            for CurrentColumn in range(Columns):

                 self._realData[CurrentRow, CurrentColumn]          =
floatData[position]
                self._imaginaryData[CurrentRow, CurrentColumn]  =
floatData[position+1]
                position = position + 2


The above code works but is much too slow. If I comment out the body
of the "for CurrentColumn in range(Columns)" loop, the performance is
perfectly adequate i.e. function call overhead associated with the
"fromfile(...)" call is not very bad at all. What seems to be most
time-consuming are the simple assignment statements in the
"CurrentColumn" for-loop.

Does anyone see any ways of speeding this up at all? Reading
everything into a complex64 ndarray in one fell swoop would certainly
be easier and faster, but at some point I'll need to split this array
into two parts (real / imaginary). I'd like to have that done
initially to keep the memory usage down since the files are so
ginormous.

Psyco is out because I need 64-bits, and I didn't see anything on the
forums regarding a method that reads in every other 32-bit chunk form
a file into an array. I'm not sure what else to try.

Thanks in advance.
L



More information about the Python-list mailing list