[SciPy-Dev] proper way to test distributions

Mon Jun 14 23:26:28 EDT 2010

On Mon, Jun 14, 2010 at 22:07, Vincent Davis <vincent at vincentdavis.net> wrote:
> I was reviewing the how tests of distribution where done in scipy with
> the thought of applying the same methods to numpy.random. I have a lot
> to learn here and appreciate you suggestions.
>
> Link to the scipy test
> http://github.com/pv/scipy-work/blob/master/scipy/stats/tests/test_continuous_basic.py
>
> If I understand correctly the tests create a sample of 2000 from a
> given distribution and the compares stats (mean, var...) calculate
> with functions from numpy with those stored in the distribution
> instant .stats  I am not sure how the mean is calculated within the
> distribution (is it just using the scipy mean)  Anyway this seems a
> little circular.
>
> Maybe I am missing something but here are my thought.
>
> 1) Using seed() and the comparing the actual results (arrays) helps to
> make sure the code is stable but tells you nothing about the quality
> of the distribution.
>
> 2) Using seed() and the calculating the moments (with numpy and
> dist.stats) is not really any different that (1)
>
> 3) drawing a large sample (possibly using seed()) and calculating the
> moments and comparing the to the theoretical moments seems like the
> best option. But this could be slow.
>
> What is the best way?
> What is desired in numpy?

While it's worthwhile to have both, you really only want (1) in the
standard unit test suite. (3) is good for working out the bugs in the
initial implementation (or retroactively doing so after the grad
student who wrote the initial implementation suddenly ran off and got
a real job. <ahem>). You can provide them, if you wish to do that
verification, but it doesn't need to be in the main test suite. (1)
provides the first layer of protection. If we make an unintentional
change to the results, (1) will catch it. If we make an intentional
change, we can use (3) to verify that our changes are good. But we
don't need to write (3) until we are actually faced with that task.

> And a little off topic but isn't numpy.random duplicating scipy or
> scipy duplicating numpy?

Not really. scipy is using those routines from numpy for most of the
duplicated distributions. numpy needed that functionality to match
Numeric's. Of course, this means that scipy's (3)-type tests should be
providing us coverage for many of numpy's distributions.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco