[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)

Thu Mar 19 13:03:16 EDT 2009

On Thu, Mar 19, 2009 at 9:30 AM, Sturla Molden <sturla at molden.no> wrote:
>
> So, it seems this version is faster and more memory efficient:
>
> def kendalltau_fromct(table):
>     '''return tau-a, tau-b and tau-c from contingency table data
>
>     example for contingency table:
>
>     count = np.array([[10,  5,  2],
>                        [ 9, 12, 16]])
>
>     '''
>
>     def _table(j,i):
>       return {
>          #'above' :       table[:j,i],
>          'below' :       table[j+1:,i],
>          #'left' :        table[j,:i],
>          'right' :       table[j,i+1:],
>          #'upper-left' :  table[:j,:i],
>          #'upper-right' : table[:j,i+1:],
>          'lower-left' :  table[j+1:,:i],
>          'lower-right' : table[j+1:,i+1:] }
>
>
>     D = table.shape[0]
>     C = table.shape[1]
>     P = 0
>     Q = 0
>     Tx = 0
>     Ty = 0
>     for i in range(C):
>         for j in range(D):
>
>             pivot = table[j,i] # use this as pivot
>             ct = _table(j,i)   # split remainder into 8 sections
>
>             # count concordant pairs
>             # -- multiply pivot with 'lower-right' and summate
>             P += np.sum(ct['lower-right'] * pivot)
>
>             # count disconcordant pairs
>             # -- multiply pivot with 'lower-left' and summate
>
>             Q += np.sum(ct['lower-left'] * pivot)
>
>             # count pairs tied in y
>             # -- multiply pivot with 'right' and summate
>
>             Ty += np.sum(ct['right'] * pivot)
>
>             # count pairs tied in x
>             # -- multiply pivot with 'below' and summate
>
>             Tx += np.sum(ct['below'] * pivot)
>
>     n = np.sum(table)
>     tau_a = 2*float(P-Q)/(n*(n-1))
>     tau_b = float(P-Q)/(sqrt((P+Q+Tx)*(P+Q+Ty)))
>     m = C if C < D else D
>     tau_c = (P-Q)*(2*m/float((m-1)*n**2))
>     return tau_a, tau_b, tau_c
>
>
>
> If Fortran 95 is difficult to build and maintain, we can do this in
> Cython as well. There is no need for function calls with array arguments
> here, so Fortan has no advantage over Cython in this case. Something
> like this:
>
>
> import numpy as np
> fom math import sqrt
> cimport numpy as np
> ctypedef np.int_t int_t
> cimport cython
>
> @cython.boundscheck(False)
> def kendalltau_fromct(np.ndarray[int_t, ndim=2] table):
>     '''return tau-a, tau-b and tau-c from contingency table data
>
>     example for contingency table:
>
>     count = np.array([[10,  5,  2],
>                        [ 9, 12, 16]])
>
>     '''
>
>     cdef int C, D, P, Q, Tx, Ty
>     cdef int i, j, ii, jj, m, n
>
>     D = <int> table.shape[0]
>     C = <int> table.shape[1]
>     P = 0
>     Q = 0
>     Tx = 0
>     Ty = 0
>     n = 0
>     for j in range(D):
>         for i in range(C):
>
>             pivot = table[j,i] # use this as pivot
>             n = n + pivot
>
>             # count concordant pairs
>             # -- multiply pivot with 'lower-right' and summate
>             for jj in range(j+1,D):
>                 for ii in range(i+1,C):
>                     P = P + table[jj,ii] * pivot
>
>
>             # count disconcordant pairs
>             # -- multiply pivot with 'lower-left' and summate
>             for jj in range(j+1,D):
>                 for ii in range(i):
>                     Q = Q + table[jj,ii] * pivot
>
>             # count pairs tied in y
>             # -- multiply pivot with 'right' and summate
>             for ii in range(i+1,C):
>                 Ty = Ty + table[j,ii] * pivot
>
>             # count pairs tied in x
>             # -- multiply pivot with 'below' and summate
>             for jj in range(j+1,D):
>                 Tx = Tx + table[jj,i] * pivot
>
>     tau_a = 2*float(P-Q)/(n*(n-1))
>     tau_b = float(P-Q)/(sqrt(float((P+Q+Tx)*(P+Q+Ty))))
>     m = C if C < D else D
>     tau_c = (P-Q)*(2*m/float((m-1)*n**2))
>     return tau_a, tau_b, tau_c
>
>
> Now I am done Kendall's tau for sure. Pick whichever you like.
>
>
>
> Sturla Molden
>

Thanks, this was very instructive, especially seeing a python, a
cython and a fortran version next to each other.
Your explanation with the 8 regions for a pivot point in the table is
very helpful. I looked at the SAS documentation and they define
Kendall tau and similar in an equivalent way, but double counting
pairs.

To be a bit more precise in the comment to Tx and Ty, we should add
that the count excludes simultaneous ties in both variables, since
this started the entire discussion
# Ty: count pairs tied in x and not in y
# Ty: count pairs tied in y and not in x

I only checked your python version, and all numbers agree with mine
and so can be considered as verified with spss and R (for tau-b).

The only part that is left to do, is to find the pvalues for tau-a and
tau-c and check the discrepancy for tau-b to R.
scipy.stats.kendalltau doesn't take ties into account when calculating
the variance of tau-b.

however, quote from the SAS documentation:
http://support.sas.com/documentation/cdl/en/statug/59654/HTML/default/statug_freq_a0000000630.htm

"Note that the ratio of  `est`  to `sqrt(variance(est))`  is the same
for the following measures: gamma, Kendall’s tau-b, Stuart’s tau-c,
Somers’ D(R|C) , and Somers’ D(C|R) . Therefore, the tests for these
measures are identical. For example, the p-values for the test of
H0:gamma=0  equal the p-values for the test of H0:tau-b=0 "

This would imply that we need to calculate the p-value only once. SAS
doesn't have information on tau-a. But there calculation of the
variance of tau-c could be directly included in the loop that
calculates tau-c.

The SAS documentation is pretty good. I should have found it earlier,
and it would have saved me a lot of time googling. The reference for
the calculation for kendall tau-b for the original data series is at
http://support.sas.com/documentation/cdl/en/procstat/59629/HTML/default/procstat_corr_sect015.htm
(here they count pairs only once)

I was not successful in finding the variance, for kendall's tau-a. I
think without p-value, we can also skip tau-a since it doesn't seem so
popular and include gamma which uses the same calculations as tau.

I will open an enhancement ticket for your contingency table version
and the extension of the current stats.kendalltau. I'm in favor of the
cython version with the python version as reference, but I still have
to figure out how to include cython code in scipy.

This was a long thread, but I'm glad that the version proliferation
and ambiguities for Kendall's tau have been cleared up.

Thanks,
Josef