[SciPy-User] peer review of scientific software

Wed Jun 5 22:46:00 EDT 2013

On Wed, Jun 5, 2013 at 6:08 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Wed, Jun 5, 2013 at 10:36 PM, Matt Newville
> <newville at cars.uchicago.edu> wrote:
>> The paper that Alan Isaac referred to that started this conversation
>> seemed to advocate for unit testing in the sense of "don't trust the
>> codes you're using, always test them".  At first reading, this seems
>> like good advice. Since unit testing (or, at least, the phrase) is
>> relatively new for software development, it gives the appearance of
>> being new advice.  But the authors damage their case by continuing on
>> by saying not to trust analysis tools built by other scientists based
>> on the reputation and prior use of thse tools.  Here, they expose the
>> weakness of favoring "unit tests" over "functional tests".  They are
>> essentially advocating throwing out decades of proven, tested work
>> (and claiming that the use of this work to date is not justified, as
>> it derives from un-due reputation of the authors of prior work) for a
>> fashionable new trend.  Science is deliberately conservative, and
>> telling scientists that unit testing is all the rage among the cool
>> programmers and they should jump on that bandwagon is not likely to
>> gain much traction.
>
> But... have you ever sat down and written tests for a piece of widely
> used academic software? (Not LAPACK, but some random large package
> that's widely used within a field but doesn't have a comprehensive
> test suite of its own.) Everyone I've heard of who's done this
> discovers bugs all over the place. Would you personally trip over them
> if you didn't test the code? Who knows, maybe not. And probably most
> of the rest -- off by one errors here and there, maybe an incorrect
> normalizing constant, etc., -- end up not mattering too much. Or maybe
> they do. How could you even tell?
>
> You should absolutely check scipy.optimize.leastsq before using it!

But leastsq has seen it's uses and we "know" it works.
My main work for scipy.stats has been to make reasonably sure it works
as advertised, (with adding sometimes "don't trust those results").

Optimizers either work or they don't work, and we see whether they
work for our problems in the "functional" testing, in statsmodels for
example.

(notwithstanding that many bugs have been fixed in scipy.optimize
where the optimizers did not work correctly and someone went to see
why.)

The recent discussion on global optimizers was on how successful they
are for different problems, not whether each individual piece is unit
tested.

> You could rewrite it too if you want, I guess, and if you write a
> thorough test suite it might even work out. But it's pretty bizarre to
> me to think that someone is going to think "ah-hah, writing my own
> code + test suite will be easier than just writing a test suite!" Sure
> some people are going to find ways to procrastinate on the real
> problem (*cough*grad students*cough*) and NIH ain't just a funding
> body. But that's totally orthogonal to whether tests are good.
>
> Honestly I'm not even sure what unit-testing "bandwagon" you're
> talking about. I insist on unit tests for my code because every time I
> fail to write them I regret it sooner or later, and I'd rather it be
> sooner. And because they pay themselves back ridiculously quickly
> because you never have to debug more than 15 lines of code at a time,
> you always know that everything the current 15 lines of code depends
> on is working correctly.
>
> Plus, white-box unit-testing can be comprehensive in a way that
> black-box functional testing just can't be. The code paths in a system
> grow like 2**n; you can reasonably test all of them for a short
> function with n < 5, but not for a whole system with n >> 100. And
> white-box unit-testing is what lets you move quickly when programming,
> because you can quickly isolate errors instead of spending all your
> time tracing through stuff in a debugger. If you want to *know* your
> code is correct, this kind of thorough testing is just a necessary
> (not sufficient!) condition. (Building on libraries that have large
> user bases is also very helpful!)

I think there are two different areas, one is writing library code for
general usage, and the other is code that is written for (initially)
one time usage as part of the research.

Library code is reasonable tested, either by usage or unit/functional tests.
If it passes functional test and usage, it should be reasonably
"safe". However, competition among packages (in
statistics/econometrics) for example creates a large incentive for
software developers to make sure the code is correct, and respond to
any reports of imprecision.
For example, nonlinear least squares, optimize.leastsq and packages
that use it have the NIST test cases. Either we pass them or we don't.
Another example, every once in a while a journal article is published
on the quality of an estimation in the most popular statistical
software packages. If one commercial package gets a bad report, then
it is usually quickly fixed, or they show that the default values that
the author of the paper used are not correct. (Last example that I
remember is GARCH fitting where most packages didn't do well because
of numerical derivative problems. The author that criticized this also
showed how to do it better, which was then quickly adapted by all
major packages.)
If users cannot trust a package, then there is a big incentives for
users to switch to a different one.
But if everyone else is using one package in a field, then an
individual researcher cannot be "blamed" for using it also, and has
little incentive to unit test.

How do you know what's the correct result for a code unit to write a unit test?
Often I don't know, and all I can do is wait for the functional test results.

Minor bugs are indistinguishable from minor design decisions, and it's
not worth a huge amount of effort to find them all:

example: in time series analysis, an indexing mistake at the beginning
of the time series changes some decimals and the mistake hides behind
other decisions for how initial conditions are treated across methods
and packages. An indexing mistake at the end of the time series screws
up the forecasting and will be visible the first time someone looks at
the forecasts.

example: degrees of freedom and small sample corrections: I have no
idea what different packages use, until I specifically read the docs
for this (if there are any, which is true for Stata and SAS, but false
for R), and test and verify against it. If I don't have exactly the
same algorithm, then I don't see minor bugs/design choices because it
doesn't make much difference in most Monte Carlos and applications.

example: Is the low precision of the Jacobian of
scipy.optimize.leastsq a feature or a bug? I don't use it.

example: Is the limited precision of scipy.special in some parameter
ranges a feature or a bug? It's a feature for me, but Pauli and some
other contributors consider them as bugs, if they can do something
about it, and have removed many of them.

example: scipy.stats.distributions numerical problems and low
precision for unusual cases. bug or feature.
I can work with 6 to 10 decimals precision in most cases, but
sometimes I or some users would like to have a lot more, or want to
evaluate the distributions at some "weird" parameters.

example: I implemented 11 methods to correct p-values for multiple
testing, 9 are verified against R, 2 are not available in R and I have
to trust my code and that they are doing well in the Monte Carlo
(although slightly worse than I would have expected).

The other part: code written for one time research:
Why would you spend a lot of time unit testing, if all you are
interested is the functional test that it works for the given
application?
And, as above, how would you know what the correct result should be,
besides "works for me".

bottom line:
I think unit testing and functional testing for scientific code is
pretty different from many other areas of software development.
It's easy to write a unit test that the right record is retrieved from
a database.
It's a lot more difficult to write a unit test that .... (I coded
correctly the asymptotic distribution for a new estimator or test
statistic.)

(How did I end up at the wrong side of the argument?
I have been advocating TDD, unit tests and verified functional tests
for five years on this mailing list.)

Josef

>
> -n
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user