[MATRIX-SIG] Much ado about nothingness.

Andrew P. Mullhaupt amullhau@ix.netcom.com
Wed, 09 Jul 1997 21:24:42 -0400


At 07:36 PM 7/9/97 -0400, Bradley C. Venner wrote:
>a sympathetic concern.  To do multiple imputation one must specify a
>mechanism for why the data is missing.  The other known data values may
>then be used to make up new values for the missing data.

Nothing like this should be _part_ of the language, since it is not
going to be possibly to specify a mechanism which is good in all cases.

What you _can_ put into the language is an efficient way to determine that
there _are_ missing values, and (in many cases), a good way to determine
_which_ values are missing.

Everything else should be done by the user on top of this, and will vary
greatly from application to application.

>In practice, I think an intermediate class with strong NumPy <->
>database bindings is probably necessary for analysis purposes.

Interestingly enough, I do not expect ever to use a "database" in the
classical sense. The reason is that most database models offer very
poor performance in the applications I use - large scale financial
computing.

What I have done in serveral different platforms is to keep the data
in a form on which my applications directly operate. As a result, I do all
sorts of statistical computations, yet no database is even remotely
important, or even useful. We've always _wanted_ to find a database which
could measure
up, (I once spent two years evaluating all the remotely viable choices),
but I don't think I can be competitive using a database in the usual sense.

It may be somewhat surprising but the majority of companies in my area
do the same thing.

> Such a class, a la data.frame in
>S-Plus, could represent both categorical and numeric data.  Methods for
>this class for converting to NumPy arrays for use in analysis would
>force the user to decide how to treat the missing values.

The data.frame class is a powerful tool, and something like it will be
implemented in Python, probably sooner, rather than later. Right now
the hitch is mostly what to do about indexing.

Although I enthusiastically support a data.frame class for python, I
see no reason why this bears on whether or not a user cares about missing
values. People with missing values will expect the data.frame to treat
missing values consistently with the methods from the parent classes,
and so _those_ methods are where the issue crops up.

> Defaults such
>as dropping rows with missing data could be available.  I think such a
>class would have a number of uses outside of missing data applications,
>particularly for data mining.  The suggestions on defining a new class
>so far are a good place to start.

Just make sure that particular methods of handling missing data are part
of a subclass, _not_ part of the data.frame class itself. You won't have
a good single method of handling missing data for most applications -
even if you are talking about handling the exact same missing value
instance being acted on by the exact same methods but in different situations.

What tends to be _easy_ to standardize is a mechanism for representing
missing values. It's definitely not a good idea to try to standardize
the methods of handling missing values, which is as application dependent
as anything I can imagine.

Later,
Andrew Mullhaupt

_______________
MATRIX-SIG  - SIG on Matrix Math for Python

send messages to: matrix-sig@python.org
administrivia to: matrix-sig-request@python.org
_______________