[scikit-learn] Smoke and Metamorphic Testing of scikit-learn

Wed Aug 22 11:49:02 EDT 2018

Hi Steffen.

Thanks for sharing your analysis. We really need more work in this 
direction.
I assume you fixed the random states everywhere?

I consider these tests helpful but not all your expectations are 
warranted depending on the model.

If you add one to each feature, there is no expectations that results 
will be the same, unless for the tree models.
For tree-based models with fixed random states, however, it's expected 
that reordering features will change the result.
For non-convex optimization it's expected that results are not symmetric 
(i.e. the MLPClassifier will not flip
the decision function because the optimization is initialized in an 
asymetric way), and reordering features will
also change the result. If using mini-batches (the default) the results 
will also change when instances are reordered.
I assume you didn't test SGDClassifier or any of it's derivatives 
because it doesn't show up here. Did you test LinearDiscriminantAnalysis?

For the invariance tests it would be interesting to know if they are due 
to tie-breaking or numerical issues.
There is some numerical issues that are very hard to control, and I'm 
pretty sure we have asymmetric tie-breaking
(multiclass libsvm is "always predict the first class" 
https://github.com/scikit-learn/scikit-learn/issues/8276 )

I would looks at QuadraticDiscriminantAnalysis a bit more closely as a 
consequence of your tests.
Maybe check if the SVM, RF and KNN issues are due to tie-breaking.

We could try and document all the cases where the result will not 
fulfill these invariances, but I think that might be too much.
At some point we need the users to understand what's going on. If you 
look at the random forest algorithm and you fix
the random state it's obvious that feature order matters.

A big question here is how big the differences are. Some algorithms are 
randomized (I think the coordinate descent in
some of the linear models uses random orders), but the results are 
expected to be near-identical, independent of the ordering.

Cheers,

Andy

On 8/22/18 7:12 AM, Steffen Herbold wrote:
> Dear developers,
>
> I am writing you because I applied an approach for the automated 
> testing of classification algorithms to scikit-learn and would like to 
> forward the results to you.
>
> The approach is a combination of smoke testing and metamorphic 
> testing. The smoke tests try to find problems by executing the 
> training and prediction functions of classifiers with different data. 
> These smoke tests should ensure the basic functioning of classifiers. 
> I defined 20 different data sets, some very simple (uniform features 
> in [0,1]), some with extreme distributions, e.g., data close to 
> machine precision. The metamorphic tests determine if classification 
> results change as expected if the training data is modified, e.g., by 
> reordering features, flipping class labels, or reordering instances.
>
> I generated 70 different Python unittest tests for eleven different 
> scikit-learn classifiers. In summary, I found the following potential 
> problems:
> - Two errors due to possibly infinite loops for the 
> LogisticRegressionClassifier for data that approaches MAXDOUBLE.
> - The classification of LogisticRegression, MLPClassifier, 
> QuadraticDiscriminantAnalysis, and SVM with a polynomial kernel 
> changed if one is added to each feature value.
> - The classification of DecisionTreeClassifier, LogisticRegression, 
> MLPClassifier, QuadraticDiscriminantAnalysis, RandomForestClassifier, 
> and SVM with a linear and a polynomial kernel were not inverted when 
> all binary class labels are flipped.
> - The classification of LogisticRegression, MLPClassifier, 
> QuadraticDiscriminantAnalysis, and RandomForestClassifier sometimes 
> changed when the features are reordered.
> - The classification of KNeighborsClassifier, MLPClassifier, 
> QuadraticDiscriminantAnalysis, RandomForestClassifier, and SVM with a 
> linear kernel sometimes changed when the instances are reordered.
>
> You can find details of our results online [1]. The provided resources 
> include the current draft of the paper that describes the tests as 
> well as detailed results in detail. Moreover, we provide an executable 
> test suite with all tests we executed, as well as the export of our 
> test results as XML file that contains all details of the test 
> execution, including stack traces in case of exceptions. The preprint 
> and online materials also contain the results for two other machine 
> learning libraries, i.e., Weka and Spark MLlib. Additionally, you can 
> find the atoml tool used to generate the tests on GitHub [2].
>
> I hope that these tests may help with the future development of 
> scikit-learn. You could help me a lot by answering the following 
> questions:
> - Do you consider the tests helpful?
> - Do you consider any source code or documentation changes due to our 
> findings?
> - Would you be interested in a pull request or any other type of 
> integration of (a subset of) the tests into your project?
> - Would you be interested in more such tests, e.g., for the 
> consideration of hyper parameters, other algorithm types like 
> clustering, or more complex algorithm specific metamorphic tests?
>
> I am looking forward to your feedback.
>
> Best regards,
> Steffen Herbold
>
> [1] http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/
> [2] https://github.com/sherbold/atoml
>