[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

josef.pktd at gmail.com josef.pktd at gmail.com
Fri Feb 27 14:48:30 EST 2009


On Fri, Feb 27, 2009 at 1:54 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
> All,
> I followed the thread without actively participating, but as the
> author of stats.mstats, I feel compelled to jump in.
>
> When I started working on some masked versions of scipy.stats version,
> numpy.ma wasn't part of numpy per se (if I remember correctly), and
> the package hadn't been thouroughly checked. Modifying scipy.stats to
> recognize masked arrays (for example, by changing _chk_array) wasn't
> really an option at the time, because numpy.ma was still considered as
> experimental. The easiest was therefore just to duplicate the
> functions. I checked that the results were consistent with the non-
> masked versions at the time, checked against R also, so I was fairly
> confident in the results. However, I didn't strive for exhaustivity: I
> coded the functions I needed and some of them direct relatives, but
> never tried to expand some more complex functions.
>
> I'm all in favor for merging the masked and non-masked versions:
> that's cleaner, easier to debug and maintain should there be some
> changes in signature (or even just doc). There's a few aspects we must
> keep in mind however:
>
> * standard numpy functions usually work well with masked arrays: if
> the input is MA, the np function should call MA.__array_wrap__ which
> will transform the result back to a MA. I have the nagging feeling
> it's not completely fool-proof, however, but the numpy.ma functions
> should always work. if the input is not a MA, then the output will
> never be masked, which may be a problem.
>
> Consider this example:
>  >>> x=np.array([0,1,2])
>  >>> np.log(x).mean()
> -inf
>  >>> ma.log(x).mean()
>  0.34657359027997264
>  >>> np.log(ma.array(x)).mean()
>  0.34657359027997264
>
> If we don't transform x to a MA, or don't use the numpy.ma function,
> we just get a NaN/Inf as results. Otherwise, we get a nice float.
>
> * Systematically using MA may be problematic: I can picture cases
> where a standard ndarray is expected as output when a standard ndarray
> is given as inputt. If we force the conversion, the result will be a
> MA. Should we convert it back to a ndarray ? Using .filled() ? But
> then, with which filling_value ?
>
> In that case, we may want to consider a "usemask" flag: if
> usemask=True and the input is a ndarray, then the output will be a MA,
> otherwise it'll be a ndarray. Using a MA as input would set usemask to
> True no matter what.
>
> * The correlation functions discard masked values pair-wise: we can
> pre-process the inputs and still use the standard functions, so no
> problem here.
>
> * Some functions (threshold) can work directly w/ MA.
>
> * Some functions (the ones based on ranking) should behave differently
> whether the input has masked values (as missing values must be taken
> as ties).
>
>
> About optimization and speed test:
> * There's definitely some room for improvement here: for example,
> instead of using the count method, we could use the count function to
> prevent any unnecessary conversion to MA.  (I'd need to optimize the
> count function, but that should be easy...). That'll depend on what we
> decide for handling MA.
> * Just running tests w/ the masked versions of the function will
> always show that they are slower, of course.
> *  Slight differences of the order of 1e-15 should not really matter.
>
> All, don't hesitate to contact me on or off-list if you have some
> specific questions about implementation details.
>

I still need to look at several examples, before I get a better feeling of
how this will work.

I just looked at the implementation of ma.var, ma.cov, ma.exp and a few more,
and I think they are very well written and I don't see any way how their
performance could be improved.
Given that gmean is a very simple function, I was pretty surprised about
the difference in timing. Now, I think that the main slowdown is that the
mask has to be checked in every operation that calls a ma.* version
of a function.

As we discussed for the OLS case for larger statistical functions, building
the main workload with plain arrays will save a lot of overhead. This works
for cases where a single compression or fill is correct for all required
numerical operations.

If we get the correct setup (interface, conversion) for two kinds of functions,
any array to ma core of the function", and "any array to plain core", then
it will be easier, at least for me, to follow this pattern when
(re)writing functions.

One more issue is the treatment of nan and masked values, for example, if a
function produces nans because of a zero division, then I would want to treat
it differently than a missing value in the data. If it is
automatically included in the
mask then this distinction is lost. Or is there a different use case for this?

In your log example, I wouldn't want to get a nice number back. I want
the function
to complain. Silently changing the definition of mathematical
operations creates a
huge potential for errors (that's why I also don't like the silent
conversions when casting to int)
For example, if this is maximum likelihood estimation, the log
likelihood is -inf
and not some nice number.
>>> x=np.array([0,1,2])
>>> np.log(x).mean()
I think if users want nice numbers, then they should mask them in the
first place.
Actually, I didn't realize this before, that ma adds additional points
to the mask.

But, before we start to rewrite and refactor across the board, I still
want to finish
cleaning up the existing functions and resolve some of the current
inconsistencies.

Josef



More information about the SciPy-Dev mailing list