[SciPy-dev] Numerical Recipes (was tagging 0.7rc1 this weekend?)

Sat Dec 20 00:12:57 EST 2008

On Fri, Dec 19, 2008 at 8:49 PM, Bill Baxter <wbaxter at gmail.com> wrote:
> On Sat, Dec 20, 2008 at 10:18 AM, Alan G Isaac <aisaac at american.edu> wrote:
>> Quoting from http://www.iusmentis.com/copyright/software/protection/
>>
>>         An implementation of a standard algorithm may
>>         require very little creativity and if so, is likely
>>         not protected by copyright. The function of a work
>>         can only be protected by a patent.
>>
>>         Even when providing a largely functional
>>         implementation, creativity can still be found in
>>         things like function and variable names, code
>>         optimizations and even the comments accompanying the
>>         code. Copying such an implementation wholesale then
>>         still infringes on copyright.
>>
>
> The problem, though, is that terms like "requires little creativity"
> are subjective.  You may think the code required no creativity, but
> the publisher of NR may disagree.  And then it takes a court to decide
> who is right.   And at that point even if SciPy is vindicated, the
> SciPy community still loses because of all the lost time an money that
> has to go into such a fight.  It's easier to just make sure you steer
> well clear of any potential infringement from the beginning.
>
> IANAL, but I think that especially once the issue is raised in a
> public forum like this -- now you've got a record that the developers
> were aware of a potential problem.  If nothing is done about it, it
> might be regarded as "willful infringement".  But if an honest attempt
> is made to replace the code, then you've at least got a public record
> that you were attempting to do your "due dilligence" to eliminate
> infringements.
>
> So, silly as it may seem, I think that Jarrod's plan of action here is
> correct.   The NR guys may be wrong about copyright law, but being
> right is not a prerequisite for filing a lawsuit.  Merely thinking you
> are is generally enough.  And sometimes not even that...
>
> --bb
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-dev
>

In these cases, I don't see any copyrightable creativity. I replaced
the use of betai with the t-distribution on recommendation of Bruce
Southey. The formula for `ttest_ind` is exactly the same as wikipedia
and I guess in any number of textbooks,
http://en.wikipedia.org/wiki/T-test#Independent_two-sample_t-test
section: Unequal sample sizes, equal variance

Variable names are not very creative and the shape and axis handling
is standard numpy style. The handling of the zero div problem seems to
be added later and is not in the current version of Gary Strangmans
code

So I don't see what can be changed here, except for cosmetic changes.

def ttest_ind(a, b, axis=0):
    a, b, axis = _chk2_asarray(a, b, axis)
    x1 = mean(a,axis)
    x2 = mean(b,axis)
    v1 = var(a,axis)
    v2 = var(b,axis)
    n1 = a.shape[axis]
    n2 = b.shape[axis]
    df = n1+n2-2
    svar = ((n1-1)*v1+(n2-1)*v2) / float(df)
    zerodivproblem = svar == 0
    t = (x1-x2)/np.sqrt(svar*(1.0/n1 + 1.0/n2))  # N-D COMPUTATION HERE!!!!!!
    t = np.where(zerodivproblem, 1.0, t)           # replace NaN
t-values with 1.0
    #probs = betai(0.5*df,0.5,float(df)/(df+t*t))
    probs = distributions.t.sf(np.abs(t),df)*2

    if not np.isscalar(t):
        probs = probs.reshape(t.shape)
    if not np.isscalar(probs) and len(probs) == 1:
        probs = probs[0]
    return t, probs

The second test, looks a little bit more complicated in the current
version, for example it defines several variables that are never used.
I am currently rewriting the main formula to follow exactly the
wikipedia formula, which is much easier to read.
For a Monte Carlo with 10000 replication and sample size 500, I get at
maximum difference of the new version to the old version of
3.9968028886505635e-015  which should be close enough. Rejection rates
for alpha = 1%, 5% and 10% are pretty close, so the test is correctly
sized.

This is the new code:

def ttest_rel(a,b,axis=None):
    a, b, axis = _chk2_asarray(a, b, axis)
    if len(a)!=len(b):
        raise ValueError, 'unequal length arrays'
    n = a.shape[axis]
    df = float(n-1)
    d = (a-b).astype('d')

    #denom is just sqrt(var(d)/df), Note: denominator for denom is 1/n * 1/(n-1)
    denom = np.sqrt(np.var(d,axis,ddof=1)/float(n))
    zerodivproblem = denom == 0
    t = np.mean(d, axis) / denom
    t = np.where(zerodivproblem, 1.0, t)    # replace NaN t-values with 1.0
    probs = distributions.t.sf(np.abs(t),df)*2
    if not np.isscalar(t):
        probs = np.reshape(probs, t.shape)
    if not np.isscalar(probs) and len(probs) == 1:
        probs = probs[0]
    return t, probs

As an aside, is there any difference between np.sum() and
np.add.reduce(), or are they equivalent or exactly the same?

Josef