[SciPy-Dev] Request to add functionality to scipy.stats

Sambit Panda spanda3 at jhu.edu
Thu Jun 6 12:09:11 EDT 2019


Request for project components' inclusion in scipy.stats

- Project name: "mgcpy"
- Authors: Satish Palaniappan (https://github.com/tpsatish95), Sambit Panda (https://github.com/sampan501), Junhao Xiong (https://github.com/junhaobearxiong), Ananya Swaminathan (https://github.com/ananyas713), Sandhya Ramachandran (https://github.com/sundaysundya), Richard Guo (https://github.com/rguo123)
- Current repository: https://github.com/neurodata/mgcpy

"mgcpy" is a Python package containing tools for independence testing and k-sample testing. Looking through the "scipy.stats" module, The module contains a host of independence and other hypothesis tests, but are limited by assumptions of normality, linearity, unidimensionality, etc. While this may be appropriate in a host of circumstances, it is increasingly important to analyze nonlinear and high dimensional trends, which is where the implementations in "mgcpy" could be very useful. Independence tests included can operate on multidimensional and nonlinear data. In addition, functionality has been extended to k-sample testing (with capabilities of operating on the same kinds of data). The tests included can not only be used for classification, but also for regression.

Below is a list of some of the integrated tests contained within "mgcpy" and citations for relevant papers about it.
- RV: P. Robert and Y. Escoufier, "A unifying tool for linear multivariate statistical methods: the rv-coefficient," Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 25, no. 3, pp. 257–265, 1976. 3
- CCA: D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, "Canonical correlation analysis: An overview with application to learning methods," Neural computation, vol. 16, no. 12, pp. 2639–2664, 2004.
- HHG: R. Heller, Y. Heller, and M. Gorfine, "A consistent multivariate test of association based on ranks of distances," Biometrika, vol. 100, no. 2, pp. 503–510, 2012.
- MDMR: N. J. Schork and M. A. Zapala, "Statistical properties of multivariate distance matrix regression for high-dimensional data analysis," Frontiers in Genetics, vol. 3, p. 190, 2012.
- Biased Dcorr, Unbiased Dcorr**: G. J. Székely, M. L. Rizzo, N. K. Bakirov et al., "Measuring and testing dependence by correlation of distances," The Annals of Statistics, vol. 35, no. 6, pp. 2769–2794, 2007.
- Mantel: N. Mantel, "The detection of disease clustering and a generalized regression approach," Cancer research, vol. 27, no. 2 Part 1, pp. 209–220, 1967.
- MANOVA: Warne, R. T. (2014). "A primer on multivariate analysis of variance (MANOVA) for behavioral scientists". Practical Assessment, Research & Evaluation. 19 (17): 1–10.
- k-sample tests: Martínez-Camblor, P., & de Uña-Álvarez, J. (2009). Non-parametric k-sample tests: Density functions vs distribution functions. Computational Statistics & Data Analysis, 53(9), 3344-3357.

Not included tests, but related useful readings:
- Equivalency of Dcorr, HSIC, Energy, and MMD: C. Shen and J. T. Vogelstein, "The exact equivalence of distance and kernel methods for hypothesis testing," arXiv preprint arXiv:1806.05514, 2018.
- Formulating k-sample tests as independence tests: C. Shen and J. T. Vogelstein, "The exact equivalence of distance and kernel methods for hypothesis testing," arXiv preprint arXiv:1806.05514, 2018.


More information about the SciPy-Dev mailing list