[AstroPy] PyFITS and mmap

Tom Aldcroft aldcroft at head.cfa.harvard.edu
Fri Sep 23 11:08:16 EDT 2011


On Fri, Sep 23, 2011 at 8:37 AM, Paul Barrett <pebarrett at gmail.com> wrote:
> Erik,
>
> The performance impact can be greater than you might think.  As an
> example, I have some Python code that uses subprocesses to divide the
> processing among eight or more processors.  The data is shared between
> the parent and child processes using memory-mapping.  The calculations
> take about 5 minutes per subprocess and then another 7 minutes or so
> to write the data to disk before the subprocess ends.  I would
> therefore prefer that memory-mapped files be an option instead of the
> default to avoid such a possible performance hit. If it is the
> default, there may be situations where the performance is poor and the
> novice user would not know why PyFITS is performing so poorly.  This
> adverse behavior may discourage users from using FITS files and
> instead use HDF5 files (i.e., the tables package), which, when I think
> about it, would be a good thing.

I'm not sure many novice users will be knowingly creating subprocesses
in their Python scripts.  I would say the case of a novice user
deciding to open a 20 Gb FITS file (and complaining about performance)
is more likely.  But I agree that you need to be pretty careful about
making a default change like this and consider (and test) a wide
variety of use cases.  +1 on HDF5 for big datasets.

- Tom A

> On Thu, Sep 22, 2011 at 12:21 PM, Erik Bray <embray at stsci.edu> wrote:
>> Hi all,
>>
>> Every now and then PyFITS gets support requests from people trying to
>> work with very large FITS files (>4GB; I've seen as high as 50 GB) and
>> having trouble when they run out of memory.
>>
>> Normally I point them to the memmap=True option to pyfits.open(), and
>> that works for them.  On 64-bit systems in particular there's more than
>> enough virtual address space to mmap very large files.
>>
>> And I got to thinking that while most FITS files I encounter are not
>> many gigabytes in size, they are still over 100 MB.  And there are only
>> so many operations that actually require having an entire array in
>> memory at once.  So maybe it would make sense to have PyFITS use mmap by
>> default.
>>
>> There could be some slight performance implications here: For example,
>> when reading the data a little bit a time mmap is a little a bit slower,
>> unsurprisingly.  But in practice I don't think it's a very noticeable
>> difference, and the benefit--far less memory usage and more transparent
>> support for large files--outweigh any drawbacks I can think of.
>>
>> I'm just putting this out there because I wonder if there are any other
>> downsides to this that I'm not thinking of.
>>
>> Thanks,
>> Erik
>> _______________________________________________
>> AstroPy mailing list
>> AstroPy at scipy.org
>> http://mail.scipy.org/mailman/listinfo/astropy
>>
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
> http://mail.scipy.org/mailman/listinfo/astropy
>
>



More information about the AstroPy mailing list