Skipping bytes while reading a binary file?

Thu Feb 5 17:48:49 EST 2009

Lionel wrote:
 > Hello,
 > I have data stored in binary files. Some of these files are
 > huge...upwards of 2 gigs or more. They consist of 32-bit float complex
 > numbers where the first 32 bits of the file is the real component, the
 > second 32bits is the imaginary, the 3rd 32-bits is the real component
 > of the second number, etc.
 >
 > I'd like to be able to read in just the real components, load them
 > into a numpy.ndarray, then load the imaginary coponents and load them
 > into a numpy.ndarray.  I need the real and imaginary components stored
 > in seperate arrays, they cannot be in a single array of complex
 > numbers except for temporarily. I'm trying to avoid temporary storage,
 > though, because of the size of the files.
 >
 > I'm currently reading the file scanline-by-scanline to extract rows of
 > complex numbers which I then loop over and load into the real/
 > imaginary arrays as follows:
 >
 >
 >         self._realData         = numpy.empty((Rows, Columns), dtype =
 > numpy.float32)
 >         self._imaginaryData = numpy.empty((Rows, Columns), dtype =
 > numpy.float32)
 >
 >         floatData = array.array('f')
 >
 >         for CurrentRow in range(Rows):
 >
 >             floatData.fromfile(DataFH, (Columns*2))
 >
 >             position = 0
 >             for CurrentColumn in range(Columns):
 >
 >                  self._realData[CurrentRow, CurrentColumn]          =
 > floatData[position]
 >                 self._imaginaryData[CurrentRow, CurrentColumn]  =
 > floatData[position+1]
 >                 position = position + 2
 >
 >
 > The above code works but is much too slow. If I comment out the body
 > of the "for CurrentColumn in range(Columns)" loop, the performance is
 > perfectly adequate i.e. function call overhead associated with the
 > "fromfile(...)" call is not very bad at all. What seems to be most
 > time-consuming are the simple assignment statements in the
 > "CurrentColumn" for-loop.
 >
[snip]
Try array slicing. floatData[0::2] will return the real parts and
floatData[1::2] will return the imaginary parts. You'll have to read up 
how to assign to a slice of the numpy array (it might be 
"self._realData[CurrentRow] = real_parts" or "self._realData[CurrentRow, 
:] = real_parts").

BTW, it's not the function call overhead of fromfile() which takes the
time, but actually reading data from the file.