[scikit-learn] Decision tree results sometimes different with scaled data

Geoffrey Bolmier geoffrey.bolmier at gmail.com
Tue Oct 22 05:32:43 EDT 2019


Hi all,

First, let me thank you for the great job your guys are doing developing and maintaining such a popular library!
As we all know decision trees are not impacted by scaled data because splits don't take into account distances between two values within a feature.
However I experienced a strange behavior using sklearn decision tree algorithm. Sometimes results of the model are different depending if input data has been scaled or not.
To illustrate my point I ran experiments on the iris dataset consisting of:
perform a train/test split

fit the training set and predict the test set

fit and predict again with standardized inputs (removing the mean and scaling to unit variance)

compare both model predictions

Experiments have been ran 10,000 times with different random seeds (cf. traceback and code to reproduce it at the end).
Results showed that for a bit more than 10% of the time we find at least one different prediction. Hopefully when it's the case only a few predictions differ, 1 or 2 most of the time. I checked the inputs causing different predictions and they are not the same depending of the run.

I'm worried if the rate of different predictions could be larger for other datasets...
Do you have an idea where it come from, maybe due to floating point errors or am I doing something wrong?

Cheers,
Geoffrey

------------------------------------------------------------
Traceback:
------------------------------------------------------------
Error rate: 12.22%

Seed: 241862
All pred equal: False
Not scale data confusion matrix:
[[16 0 0]
[ 0 17 0]
[ 0 4 13]]
Scale data confusion matrix:
[[16 0 0]
[ 0 15 2]
[ 0 4 13]]
------------------------------------------------------------
Code:
------------------------------------------------------------
import numpy as np

from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

X, y = load_iris(return_X_y=True)
def run_experiment(X, y, seed):
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
stratify=y,
test_size=0.33,
random_state=seed
)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = DecisionTreeClassifier(random_state=seed)
clf_scaled = DecisionTreeClassifier(random_state=seed)

clf.fit(X_train, y_train)
clf_scaled.fit(X_train_scaled, y_train)

pred = clf.predict(X_test)
pred_scaled = clf_scaled.predict(X_test_scaled)

err = 0 if all(pred == pred_scaled) else 1

return err, y_test, pred, pred_scaled

n_err, n_run, seed_err = 0, 10000, None
for _ in range(n_run):
seed = np.random.randint(10000000)
err, _, _, _ = run_experiment(X, y, seed)
n_err += err

# keep aside last seed causing an error
seed_err = seed if err == 1 else seed_err

print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n')
_, y_test, pred, pred_scaled = run_experiment(X, y, seed_err)
print(f'Seed: {seed_err}')
print(f'All pred equal: {all(pred == pred_scaled)}')
print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test, pred)}')
print(f'Scale data confusion matrix:\n{confusion_matrix(y_test, pred_scaled)}')
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191022/4281323d/attachment-0001.html>


More information about the scikit-learn mailing list