[AstroPy] reading one line from many small fits files
Erik Bray
embray at stsci.edu
Tue Jul 31 11:52:07 EDT 2012
On 07/30/2012 08:57 PM, Derek Homeier wrote:
> Hi John,
>
> On 31.07.2012, at 1:40AM, "John K. Parejko"<john.parejko at yale.edu> wrote:
>
>> This is really more of a pyfits question, but I've upgraded to pyfits 3.1 (SVN), which is the version in astropy.
>>
>> I have data stored in thousands of ~few MB .fits files (photoObj files from SDSS) totaling a few TB of data, and I know the one single line I want to extract from some known subset of those files. But pyfits is taking more than a second per file to extract the fields I want, which seems very long, especially if it is using memmapped access, and thus should only have to read that single line (plus the header) from each file.
>>
>> I'm doing something like this:
>>
>> result = np.empty(len(data),dtype=dtype)
>> for i,x in enumerate(data):
>> getfilename(x[somefield])
>> photo = pyfits.open(photo,memmap=True)
>> result[i] = photo[1].data[x[otherfield]-1]
>>
>> Is there a better way to go about this? Is pyfits known to be quite slow when reading a single row from a lot of different files? Anyone have suggestions on how to speed this up?
>
> that seems quite slow; it takes me about 50 ms to read a random line from the DR8 example file
> with pyfits 3.0.2. Unless the file access itself takes so long something appears to be odd.
> But the only thing coming to my mind now is that pyfits supports scaled column data (similar to
> BSCALE/BZERO in image HDUs, I assume), and if such keywords were present, they would probably
> cause a corresponding transformation for the entire bintable. They don't seem to exist in the standard
> SDSS files, though.
> Naïve question: do you call photo.close() after each read?
>
> Cheers,
> Derek
It probably shouldn't matter whether or not he's calling close(), but
the question about BSCALE/BZERO is possibly relevant. Is the data
you're reading from an image or a table? If it's an image, then as
Derek wrote PyFITS is still pretty inefficient, in that it will
transform the entire image, even if using mmap (which is the default now
by the way). I have plans for overhauling this but it hasn't been a high
priority for the most part. You can also turn off image scaling if you
use do_not_scale_image_data=True when opening the file. That might
speed things up.
This is one area where using the .section feature on Image HDUs might
still be useful. For example:
result[i] = photo[1].section[x[otherfield] - 1]
In PyFITS 3.1 that has improved support for scaling just sections of the
file, which didn't work well before. That might also be faster.
Of course this is all a moot point if this is not a scaled image. In
any case, opening a file and reading a single row out of the data should
not generally take as long as 1 second--especially if they're small files.
Erik
More information about the AstroPy
mailing list