[AstroPy] reading one line from many small fits files

Tue Jul 31 12:04:47 EDT 2012

On Tue, Jul 31, 2012 at 11:52 AM, Erik Bray <embray at stsci.edu> wrote:
> On 07/30/2012 08:57 PM, Derek Homeier wrote:
>> Hi John,
>>
>> On 31.07.2012, at 1:40AM, "John K. Parejko"<john.parejko at yale.edu>  wrote:
>>
>>> This is really more of a pyfits question, but I've upgraded to pyfits 3.1 (SVN), which is the version in astropy.
>>>
>>> I have data stored in thousands of ~few MB .fits files (photoObj files from SDSS) totaling a few TB of data, and I know the one single line I want to extract from some known subset of those files. But pyfits is taking more than a second per file to extract the fields I want, which seems very long, especially if it is using memmapped access, and thus should only have to read that single line (plus the header) from each file.
>>>
>>> I'm doing something like this:
>>>
>>>     result = np.empty(len(data),dtype=dtype)
>>>     for i,x in enumerate(data):
>>>      getfilename(x[somefield])
>>>         photo = pyfits.open(photo,memmap=True)
>>>         result[i] = photo[1].data[x[otherfield]-1]
>>>
>>> Is there a better way to go about this? Is pyfits known to be quite slow when reading a single row from a lot of different files? Anyone have suggestions on how to speed this up?
>>
>> that seems quite slow; it takes me about 50 ms to read a random line from the DR8 example file
>> with pyfits 3.0.2. Unless the file access itself takes so long something appears to be odd.
>> But the only thing coming to my mind now is that pyfits supports scaled column data (similar to
>> BSCALE/BZERO in image HDUs, I assume), and if such keywords were present, they would probably
>> cause a corresponding transformation for the entire bintable. They don't seem to exist in the standard
>> SDSS files, though.
>> Naïve question: do you call photo.close() after each read?
>>
>> Cheers,
>>                                               Derek
>
> It probably shouldn't matter whether or not he's calling close(), but
> the question about BSCALE/BZERO is possibly relevant.  Is the data
> you're reading from an image or a table?  If it's an image, then as
> Derek wrote PyFITS is still pretty inefficient, in that it will
> transform the entire image, even if using mmap (which is the default now
> by the way). I have plans for overhauling this but it hasn't been a high
> priority for the most part.  You can also turn off image scaling if you
> use do_not_scale_image_data=True when opening the file.  That might
> speed things up.
>
> This is one area where using the .section feature on Image HDUs might
> still be useful.  For example:
>
> result[i] = photo[1].section[x[otherfield] - 1]
>
> In PyFITS 3.1 that has improved support for scaling just sections of the
> file, which didn't work well before.  That might also be faster.
>
> Of course this is all a moot point if this is not a scaled image.  In
> any case, opening a file and reading a single row out of the data should
> not generally take as long as 1 second--especially if they're small files.
>
> Erik
>
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
> http://mail.scipy.org/mailman/listinfo/astropy
>

Hi Erik,

Not to hijack the original question, but I have a very similar
question.  When reading a single column from a binary FITS table
(with many other uninteresting columns) using PyFITS are there
strategies that will improve performance?  Is using memmap helpful?
By default (without memmap) does PyFITS read the entire table into
memory when a single column is requested?

I could probably figure some of this out experimentally with a few
tests, but as long as we're on the topic you probably just know the
answer.  :-)

Thanks,
Tom

p.s. I think the original question was also referring to a table, not image.