[AstroPy] reading one line from many small fits files

Tue Jul 31 11:52:07 EDT 2012

On 07/30/2012 08:57 PM, Derek Homeier wrote:
> Hi John,
>
> On 31.07.2012, at 1:40AM, "John K. Parejko"<john.parejko at yale.edu>  wrote:
>
>> This is really more of a pyfits question, but I've upgraded to pyfits 3.1 (SVN), which is the version in astropy.
>>
>> I have data stored in thousands of ~few MB .fits files (photoObj files from SDSS) totaling a few TB of data, and I know the one single line I want to extract from some known subset of those files. But pyfits is taking more than a second per file to extract the fields I want, which seems very long, especially if it is using memmapped access, and thus should only have to read that single line (plus the header) from each file.
>>
>> I'm doing something like this:
>>
>>     result = np.empty(len(data),dtype=dtype)
>>     for i,x in enumerate(data):
>> 	getfilename(x[somefield])
>>         photo = pyfits.open(photo,memmap=True)
>>         result[i] = photo[1].data[x[otherfield]-1]
>>
>> Is there a better way to go about this? Is pyfits known to be quite slow when reading a single row from a lot of different files? Anyone have suggestions on how to speed this up?
>
> that seems quite slow; it takes me about 50 ms to read a random line from the DR8 example file
> with pyfits 3.0.2. Unless the file access itself takes so long something appears to be odd.
> But the only thing coming to my mind now is that pyfits supports scaled column data (similar to
> BSCALE/BZERO in image HDUs, I assume), and if such keywords were present, they would probably
> cause a corresponding transformation for the entire bintable. They don't seem to exist in the standard
> SDSS files, though.
> Naïve question: do you call photo.close() after each read?
>
> Cheers,
> 						Derek

It probably shouldn't matter whether or not he's calling close(), but 
the question about BSCALE/BZERO is possibly relevant.  Is the data 
you're reading from an image or a table?  If it's an image, then as 
Derek wrote PyFITS is still pretty inefficient, in that it will 
transform the entire image, even if using mmap (which is the default now 
by the way). I have plans for overhauling this but it hasn't been a high 
priority for the most part.  You can also turn off image scaling if you 
use do_not_scale_image_data=True when opening the file.  That might 
speed things up.

This is one area where using the .section feature on Image HDUs might 
still be useful.  For example:

result[i] = photo[1].section[x[otherfield] - 1]

In PyFITS 3.1 that has improved support for scaling just sections of the 
file, which didn't work well before.  That might also be faster.

Of course this is all a moot point if this is not a scaled image.  In 
any case, opening a file and reading a single row out of the data should 
not generally take as long as 1 second--especially if they're small files.

Erik