[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)

Wed Mar 18 21:58:45 EDT 2009

On Wed, Mar 18, 2009 at 6:00 PM, Sturla Molden <sturla at molden.no> wrote:
>
>> I was thinking about how to best use the contigency-table for tau-b. I
>> don't really see how it can be done without some loops. It may be easier
>> to do this in Fortran.
>
> f2py'ing something like this should work (not thoroughly tested though)...
>
> subroutine taub(C, D, table, tau)
>    implicit none
<snip>

I have a hard time following the double loop and slice indices inside
the double loop. It would be fast and safe on memory since it runs
through the double loop only once, but hard to read, debug and
maintain.

Since I already did most of the work, I finished kendall's tau-a,
tau-b and tau-c for both versions, for original 2 data vectors and for
the contingency table, tau-b and tau-c  are verified with the spss
example for the contingency table

the results: first column from contingency table, second column from 2
data vector version
tau_a 0.190775681342 0.190775681342
tau_b 0.349323424794 (0.34932342479397172, 0.00019199787220495093)
tau_c 0.374485596708 0.374485596708

I didn't use any loop but instead used intermediate arrays. Several
calculations are necessary because I use the contingency table in
flattened format. And I think it's quite readable.  Here's the
contingency table version:

def kendalltau_fromct(x, y, count):
    '''return tau-a, tau-b and tau-c from contingency table data

    example for contingency table
    x = np.array([1,1,1,2,2,2])
    y = np.array([1,2,3,1,2,3])
    count = np.array([10, 5, 2, 9, 12, 16])
    '''

    catx = np.unique(x)
    caty = np.unique(y)
    ncatx = len(catx)
    ncaty = len(caty)

    deltax = np.sign(x[:,np.newaxis] - x)
    deltay = np.sign(y[:,np.newaxis] - y)

    paircount = count[:,np.newaxis]*count - np.diag(count)
    # number of concordant minus number of discordant pairs
    netconc = np.sum((deltax*deltay)*paircount)

    # calculation for tau-a
    tau_a = netconc/(1.*paircount.sum())

    # calculation for tau-c
    m = min(ncatx,ncaty)
    tau_c = netconc /(1.*count.sum()**2)* m/(m-1.0)

    #extra calculation for tau_b
    # row and column counts of contingency table
    countx = np.dot(count,(x[:,np.newaxis]==catx))
    county = np.dot(count,(y[:,np.newaxis]==caty))
    #total number of pairs
    npairs = paircount.sum()
    #number of ties
    ntiex = np.dot(countx,(countx-1))
    ntiey = np.dot(county,(county-1))

    denom = 1.0*np.sqrt((npairs - ntiex ) * (npairs - ntiey))
    tau_b = netconc / denom

    return tau_a, tau_b, tau_c

violence = np.array([1,1,1,2,2,2])
rating = np.array([1,2,3,1,2,3])
count = np.array([10, 5, 2, 9, 12, 16])
vi = np.repeat(violence,count)
ra = np.repeat(rating,count)
tau_a, tau_b, tau_c = kendalltau_fromct(violence, rating, count)
print 'tau_a', tau_a, kendalltaustata(vi,ra)
print 'tau_b', tau_b, stats.kendalltau(vi, ra)
print 'tau_c', tau_c, kendalltauc(vi, ra)

Josef