[scikit-learn] Smoke and Metamorphic Testing of scikit-learn

Wed Aug 22 07:12:51 EDT 2018

Dear developers,

I am writing you because I applied an approach for the automated testing 
of classification algorithms to scikit-learn and would like to forward 
the results to you.

The approach is a combination of smoke testing and metamorphic testing. 
The smoke tests try to find problems by executing the training and 
prediction functions of classifiers with different data. These smoke 
tests should ensure the basic functioning of classifiers. I defined 20 
different data sets, some very simple (uniform features in [0,1]), some 
with extreme distributions, e.g., data close to machine precision. The 
metamorphic tests determine if classification results change as expected 
if the training data is modified, e.g., by reordering features, flipping 
class labels, or reordering instances.

I generated 70 different Python unittest tests for eleven different 
scikit-learn classifiers. In summary, I found the following potential 
problems:
- Two errors due to possibly infinite loops for the 
LogisticRegressionClassifier for data that approaches MAXDOUBLE.
- The classification of LogisticRegression, MLPClassifier, 
QuadraticDiscriminantAnalysis, and SVM with a polynomial kernel changed 
if one is added to each feature value.
- The classification of DecisionTreeClassifier, LogisticRegression, 
MLPClassifier, QuadraticDiscriminantAnalysis, RandomForestClassifier, 
and SVM with a linear and a polynomial kernel were not inverted when all 
binary class labels are flipped.
- The classification of LogisticRegression, MLPClassifier, 
QuadraticDiscriminantAnalysis, and RandomForestClassifier sometimes 
changed when the features are reordered.
- The classification of KNeighborsClassifier, MLPClassifier, 
QuadraticDiscriminantAnalysis, RandomForestClassifier, and SVM with a 
linear kernel sometimes changed when the instances are reordered.

You can find details of our results online [1]. The provided resources 
include the current draft of the paper that describes the tests as well 
as detailed results in detail. Moreover, we provide an executable test 
suite with all tests we executed, as well as the export of our test 
results as XML file that contains all details of the test execution, 
including stack traces in case of exceptions. The preprint and online 
materials also contain the results for two other machine learning 
libraries, i.e., Weka and Spark MLlib. Additionally, you can find the 
atoml tool used to generate the tests on GitHub [2].

I hope that these tests may help with the future development of 
scikit-learn. You could help me a lot by answering the following questions:
- Do you consider the tests helpful?
- Do you consider any source code or documentation changes due to our 
findings?
- Would you be interested in a pull request or any other type of 
integration of (a subset of) the tests into your project?
- Would you be interested in more such tests, e.g., for the 
consideration of hyper parameters, other algorithm types like 
clustering, or more complex algorithm specific metamorphic tests?

I am looking forward to your feedback.

Best regards,
Steffen Herbold

[1] http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/
[2] https://github.com/sherbold/atoml

-- 
Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstraße 7
37077 Göttingen, Germany
mailto. herbold at cs.uni-goettingen.de
tel. +49 551 39-172037