[AstroPy] reading one line from many small fits files

Fri Aug 3 17:25:37 EDT 2012

On Fri, Aug 3, 2012 at 1:48 PM, Erik Bray <embray at stsci.edu> wrote:
> A few additional trials gave roughly the same results.  I'm also less
> astonished by the speed differences, simply in that fitsio wraps
> CFITSIO, a C library, while much of PyFITS is pure Python.  Looking at
> the profile, it spends about 2/5th of the time just opening the file and
> creating objects for the Header and HDU structures.  There are some more
> micro-optimizations to be made there, but not much.  PyFITS provides a
> very flexible and extensible object-oriented interface that simply isn't
> possible with CFITSIO, but there's a tradeoff there in terms of raw
> performance, since it's all in pure Python.  For example, in this
> benchmark, PyFITS spends over half a second (cumulatively, under the
> profiler) just on the routine for determining which HDU subclass to
> initialize based on the header keywords--CFITSIO has no equivalent
> routine because it doesn't even care what the HDU type is until you try
> to read some data.  And even then the only real distinction it tries to
> make is, "Is this an image or a table?"

Hi Erik -

Actually I *do* run through all the HDUs and gather all the metadata,
even in John's case of reading a single row and column from a particular
HDU.  It just turns out that CFITSIO provides efficient routines to
access this information, so it doesn't incur significant overhead.  It
might be that the efficiency in the CFITSIO routines could be translated
to python and used in PyFits.

> Where PyFITS really takes a big hit performance-wise is in the handling
> of table columns, and, as Perry mentioned, the conversion from the raw
> data to Python data types like bools and strings.  As I wrote earlier in
> this thread, the biggest problem is that PyFITS' design has always been
> optimized for column-based access, and is horribly inefficient for
> row-based access, since the latter usually involves reading entire
> columns into memory anyways.  The reason for this is mostly
> historical--PyFITS' table interface is built on top of Numpy's recarray
> object, which I think is pretty flawed to begin with.  At the time this
> was necessary because PyFITS did not yet support compound dtypes in its
> normal ndarrays.  At least I think that was the issue.  But now it seems
> to be more of a hindrance.

This is a fundamental limitation for structured numpy arrays and the
memmap interface.

I developed a C code to get around this limitation and added it to
numpy.  This can be used by pyfits for tables (but not variable length
columns). Unfortunately that project got a bit hijacked by other
developers because they want the ascii reading portion of the code to do
type inference. This has dramatically increased the scope of the project
and delayed things indefinitely.  I am no longer involved, but it is my
understanding that this functionality should be available in an upcoming
version of numpy.

-e

-- 
Erin Scott Sheldon
Brookhaven National Laboratory erin dot sheldon at gmail dot com