[SciPy-User] peer review of scientific software

Wed Jun 5 17:36:54 EDT 2013

Hi,

On Wed, Jun 5, 2013 at 4:05 AM, Matthew Brett <matthew.brett at gmail.com> wrote:
> Hi,
>
> On Tue, Jun 4, 2013 at 4:27 AM,  <josef.pktd at gmail.com> wrote:
>> On Tue, Jun 4, 2013 at 4:07 AM, Suzen, Mehmet <msuzen at gmail.com> wrote:
>>> On 28 May 2013 20:23, Calvin Morrison <mutantturkey at gmail.com> wrote:
>>>>
>>>> http://arxiv.org/pdf/1210.0530v3.pdf
>>>>
>>>> Pissed-off Scientific Programmer,
>>>> Calvin Morrison
>>>
>>> Those recent papers and discussions all talk about good practises. I
>>> was thinking
>>> today in the bus, why there are not many literature on scientific
>>> software development
>>> methodologies. One explicit paper I found was from 80s called
>>>
>>> A Development Methodology for Scientific Software
>>> Cort, G. et. al.
>>> http://dx.doi.org/10.1109/TNS.1985.4333629
>>>
>>> It is pretty classic approach for today's standard,  There is also a book about
>>> generic style and good practice, its a pretty good book (might be
>>> mentioned in this list before):
>>>
>>> Writing Scientific Software: A Guide to Good Style
>>> Suely Oliveira and David E. Stewart
>>> http://www.cambridge.org/9780521858960
>>>
>>> but I don't see any reference to modern development methodologies specifically
>>> address to scientific software. For example: extensions of test driven
>>> development,
>>> which would suit better than classic
>>> specification-design-coding-testing. Test cases
>>> would be directly related to what we would like to achieve in the
>>> first place. For example
>>> a generic density of something etc. I haven't heard anyone developing
>>> scientific software
>>> in this way...yet.
>>
>> I think functional (not unit) testing is pretty much the standard in
>> the area of developing statistical algorithms even if nobody calls it
>> that way. And I don't know of any references to software development
>> for it.
>>
>> When writing a library function for existing algorithms, then it is
>> standard to test it against existing results. Many (or most) software
>> packages, or articles that describe the software, show that they
>> reproduce existing results as test cases.
>> (And that's the way we work for statsmodels.)
>>
>> For new algorithms, it is standard to publish Monte Carlo studies that
>> show that the new algorithm is "better" in at least some cases or
>> directions than the existing algorithms (or statistical estimators and
>> tests), and often they use published case studies or applied results
>> to show how the conclusion would differ or be unchanged
>>
>> (Just for illustration: the workflow of some friends of mine that are
>> theoretical econometricians.
>> First write the paper with the heavy theory and proofs, then start to
>> write the MonteCarlo, the first version doesn't deliver the results
>> that can be expected based on the theory, look for bugs and fix those,
>> rerun MonteCarlo, iterate, then find different test cases, simulated
>> data generating processes, and show where it works and where it
>> doesn't, and check the theoretical explanation/intuition why it
>> doesn't work in some cases. Submit only cases that work, and write a
>> footnote for the other cases.)
>
> Here is an example of some incorrect theory combined with a simulation
> showing correct results.  It turned out there were two separate errors
> in theory which balanced each other out in the particular case used
> for the simulation.
>
> This paper reviews and corrects the previous paper:
>
> http://www.math.mcgill.ca/keith/fmriagain/fmriagain.abstract.html
>
> Quote from section 2.2:
>
> "In general the variance of the parameter estimates is underestimated
> by equation (3) but the
> estimator of the variance is overestimated by equation (6), so that
> the two tend to cancel
> each other out in the T statistic (5). It can be shown that they do
> cancel out almost exactly
> for the random regressors that were chosen for validating the methods,
> which explains why
> the biases were not obsereved. However for other non-random regressors
> these eŽects do not
> cancel and large discrepancies can occur."
>
> I think that points at the need to write tests for all parts not just the whole.
>
> Cheers,
>
> Matthew
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

This has been a fairly wide-ranging conversation, but I want to say
that I agree completely with Josef's views here.

Functional testing has been the norm for scientific software, and, I
would say, justifiably so.  It is how scientists are trained to think
and work.  Yes, one must question and understand the details of the
instruments and algorithms you use and keep them well-calibrated
(which is approximately the meaning of the currently fashionable "unit
testing").  But at some point, you trust that your instruments are
calibrated and working, and the algorithms you're using are mostly
bug-free, and build on these to make real measurements and do real
analyses on new, untested systems.

The paper that Alan Isaac referred to that started this conversation
seemed to advocate for unit testing in the sense of "don't trust the
codes you're using, always test them".  At first reading, this seems
like good advice. Since unit testing (or, at least, the phrase) is
relatively new for software development, it gives the appearance of
being new advice.  But the authors damage their case by continuing on
by saying not to trust analysis tools built by other scientists based
on the reputation and prior use of thse tools.  Here, they expose the
weakness of favoring "unit tests" over "functional tests".  They are
essentially advocating throwing out decades of proven, tested work
(and claiming that the use of this work to date is not justified, as
it derives from un-due reputation of the authors of prior work) for a
fashionable new trend.  Science is deliberately conservative, and
telling scientists that unit testing is all the rage among the cool
programmers and they should jump on that bandwagon is not likely to
gain much traction.

To be clear, and to use an example I am familiar with, these authors
imply "don't trust scipy.optimize.leastsq() -- test it and be
skeptical before using it".  The problem is that this is easily read
as "if you write your own minimization code and write unit tests,
you're doing better than this older, outdated work.  That piece of
junk was used by others based solely on the reputation of the authors,
they didn't have any unit tests at all!".

One of the key features of scipy is that it reuses well-tested work
(LAPACK, MINPACK-1, FFTPACK, and similar well-tested approaches). Now,
there might be a bug in these (and, there might be a bug in the scipy
wrapper), but the likelihood of finding a new bug with any particular
use case is vanishingly small.    No one is saying MINPACK-1 (or
scipy.optimize.leastsq, or the Standard Model) is perfect and
complete.  But it works well on one heck of a lot of cases.  At some
point you *must* rely on these to make progress.  In fact, doing so
(applying existing models to new problems and using the results, that
is, functional testing) is the classic way in which flaws in the
underlying models are found.

Yes, calibrating the wazoo out of instruments and algorithms, and
pushing every button independently (unit testing) is very useful.  I
don't think anyone is advocating against doing this.  But doing this
to the exclusion of existing methodologies is going to meet resistance
among scientists, and for good reason.

The main problem with the Reinhart and Rogoff paper (and the success
of the re-interpretation by Herndon, Ash, and Pollin) is a good
example of how science works, and what to avoid.  And that is *not*
(as some seemed to have suggested) to avoid Excel or spreadsheets in
favor of procedural programming approaches, but rather to not use
home-built, poorly-tested and poorly-described algorithms.  Yes, if
they had used R or scipy they may have been better off.   Unit testing
would have helped them.  Any testing would have helped them, as would
better explanation of their methods.

I'm sorry to admit that I read only the abstract, but I would not be
surprised if Matthew Brett's example also fell into this category.
That is, were the nearly-cancelling mistakes discovered because of
unit testing or because of tests of the whole?  Obviously, if two
functions were always (always!) used together, and had canceling
errors (say, one function "incorrectly" scaled by a factor of 2 and
the other incorrectly scaled by a factor or 1/2), unit testing might
show flaws that never, ever changed the end results.

Functional testing (applying a set of analysis tools to a wide range
of data, as with a Monte Carlo approach), seems completely sensible to
me.  You would not be saying that every component is independently
checked and proven reliable on its own, but you are testing the whole.

Sorry this was so long,

--Matt