[AstroPy] striding through arbitrarily large files

Erik Bray embray at stsci.edu
Mon Feb 3 14:16:32 EST 2014


Indeed, normally I would suggest just to use mmap, but if you really have to 
support 32-bit systems that won't help you as much as one might like.  As it is 
what you are doing is mostly what you need to do for Numpy, but you don't need 
to do thankfully is read a string from the file and then use numpy.fromstring.

I actually only just discovered this myself, as it is undocumented, and have yet 
to fix it in PyFITS.  But if you pass numpy.fromfile an open file object it will 
read starting from wherever the file pointer is positioned, rather than the 
beginning of the file.

So you can just:

f = open('array.raw', 'rb')
f.seek(np.dtype('float32').itemsize * offset)
section = np.fromfile(f, dtype='float32')

So that at least saves you from having to perform the read() first.

Erik

On 02/01/2014 05:50 PM, RayS wrote:
> I hope this isn't too off-topic for astro, but I know many here work with huge
> files.
>
> I was struggling yesterday with  methods of reading large disk files into numpy
> efficiently (not FITS, just raw files of IEEE floats). When loading arbitrarily
> large files it would be nice to not bother reading more than the plot can
> display before zooming in.
>
> With a 2GB file, I want to read n (like 4096) evenly sampled points out of it.
> I tried making a dtype, and other tricks, to read "Pythonically", but failed. I
> broke down and used a for loop with fh.seek() and fromstring()
>
> num_channels = 9
> desired_len = 4096
> bytes_per_val = numpy.dtype(numpy.float32).itemsize
> f_obj = open(path, 'rb')
> f_obj.seek(0,2)
> file_length = f_obj.tell()
> f_obj.seek(0,0)
> bytes_per_smp = num_channels * bytes_per_val
> num_samples = file_length / bytes_per_smp
> stride_smps = num_samples / desired_len ## an int
> stride_bytes = stride_smps * bytes_per_smp
>
> arr = numpy.zeros((desired_len, 9))
> for i in range(0, desired_len, 1):
>      f_obj.seek(i*stride_bytes, 0)
>      arr[i] = numpy.fromstring(f_obj.read(36), dtype='f32', count=9)
>
> So, is there a better way to move the pointer through the file without a for loop?
> Would a generator be much faster?
>
> The dtype and other methods like mmap fail with memoryError, although apparently
> you can mmap with 64bit systems.
>
> - Ray
>
>
>
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
> http://mail.scipy.org/mailman/listinfo/astropy
>




More information about the AstroPy mailing list