[Numpy-discussion] striding through arbitrarily large files

Tue Feb 4 10:01:48 EST 2014

I was struggling with  methods of reading large disk files into numpy 
efficiently (not FITS or .npy, just raw files of IEEE floats from 
numpy.tostring()). When loading arbitrarily large files it would be 
nice to not bother reading more than the plot can display before 
zooming in. There apparently are no built in methods that allow 
skipping/striding...

With a 2GB file, I want to read n (like 4096) evenly sampled points out of it.
I tried making a dtype, and other tricks, to read "Pythonically", but 
failed. I broke down and used a for loop with fh.seek() and fromfile()
The file will be open()ed once, but data read many times.

num_channels = 9
desired_len = 4096
bytes_per_val = numpy.dtype(numpy.float32).itemsize
f_obj = open(path, 'rb')
f_obj.seek(0,2)
file_length = f_obj.tell()
f_obj.seek(0,0)
bytes_per_smp = num_channels * bytes_per_val
num_samples = file_length / bytes_per_smp
stride_smps = num_samples / desired_len ## an int
stride_bytes = (stride_smps - 1) * bytes_per_smp

arr = numpy.zeros((desired_len, 9))

for i in range(0, desired_len, 1):
     f_obj.seek(i*stride_bytes, 0)
     arr[i] = numpy.fromfile(f_obj, dtype='f32', count=9)

So, is there a better way to move the pointer through the file 
without a for loop?
A generator?

The dtype and other methods like mmap fail with memoryError since 
they still try to load the whole file, although apparently you can 
mmap with 64bit systems, which I might try soon with a new 64bit install.

- Ray

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140204/7f6a33b6/attachment.html>