[scikit-learn] Decision tree results sometimes different with scaled data

Alexandre Gramfort alexandre.gramfort at inria.fr
Thu Oct 24 08:09:01 EDT 2019


another reason is that we take as threshold the mid point between sample
values
which is not invariant to arbitrary scaling of the features

Alex



On Tue, Oct 22, 2019 at 11:56 AM Guillaume Lemaître <g.lemaitre58 at gmail.com>
wrote:

> Even with the same random state, it can happen that several features will
> lead to a best split and this split is chosen randomly (even with the seed
> fixed - this is reported as an issue I think). Therefore, the rest of the
> tree could be different leading to different prediction.
>
> Another possibility is that we compute the difference between the current
> threshold and the next to be tried and only check the entropy if it is
> larger than a specific value (I would need to check the source code). After
> scaling, it could happen that 2 feature values become too closed to be
> considered as a potential split which will make a difference between scaled
> and scaled features. But this diff should be really small.
>
> This is the what I can think on the top of the head.
>
> Sent from my phone - sorry to be brief and potential misspell.
> *From:* geoffrey.bolmier at gmail.com
> *Sent:* 22 October 2019 11:34
> *To:* scikit-learn at python.org
> *Reply to:* scikit-learn at python.org
> *Subject:* [scikit-learn] Decision tree results sometimes different with
> scaled data
>
> Hi all,
>
> First, let me thank you for the great job your guys are doing developing
> and maintaining such a popular library!
>
> As we all know decision trees are not impacted by scaled data because
> splits don't take into account distances between two values within a
> feature.
>
> However I experienced a strange behavior using sklearn decision tree
> algorithm.  Sometimes results of the model are different depending if input
> data has been scaled or not.
>
> To illustrate my point I ran experiments on the iris dataset consisting of:
>
>    - perform a train/test split
>    - fit the training set and predict the test set
>    - fit and predict again with standardized inputs (removing the mean
>    and scaling to unit variance)
>    - compare both model predictions
>
> Experiments have been ran 10,000 times with different random seeds (cf.
> traceback and code to reproduce it at the end).
> Results showed that for a bit more than 10% of the time we find at least
> one different prediction. Hopefully when it's the case only a few
> predictions differ, 1 or 2 most of the time. I checked the inputs causing
> different predictions and they are not the same depending of the run.
>
> I'm worried if the rate of different predictions could be larger for other
> datasets...
> Do you have an idea where it come from, maybe due to floating point errors
> or am I doing something wrong?
>
> Cheers,
> Geoffrey
>
>
> ------------------------------------------------------------
> Traceback:
> ------------------------------------------------------------
> Error rate: 12.22%
>
> Seed: 241862
> All pred equal: False
> Not scale data confusion matrix:
> [[16  0  0]
> [ 0 17  0]
> [ 0  4 13]]
> Scale data confusion matrix:
> [[16  0  0]
> [ 0 15  2]
> [ 0  4 13]]
> ------------------------------------------------------------
> Code:
> ------------------------------------------------------------
> import numpy as np
>
> from sklearn.datasets import load_iris
> from sklearn.metrics import confusion_matrix
> from sklearn.model_selection import train_test_split
> from sklearn.preprocessing import StandardScaler
> from sklearn.tree import DecisionTreeClassifier
>
>
> X, y = load_iris(return_X_y=True)
>
> def run_experiment(X, y, seed):
>     X_train, X_test, y_train, y_test = train_test_split(
>             X,
>             y,
>             stratify=y,
>             test_size=0.33,
>             random_state=seed
>         )
>
>     scaler = StandardScaler()
>
>     X_train_scaled = scaler.fit_transform(X_train)
>     X_test_scaled = scaler.transform(X_test)
>
>     clf = DecisionTreeClassifier(random_state=seed)
>     clf_scaled = DecisionTreeClassifier(random_state=seed)
>
>     clf.fit(X_train, y_train)
>     clf_scaled.fit(X_train_scaled, y_train)
>
>     pred = clf.predict(X_test)
>     pred_scaled = clf_scaled.predict(X_test_scaled)
>
>     err = 0 if all(pred == pred_scaled) else 1
>
>     return err, y_test, pred, pred_scaled
>
>
> n_err, n_run, seed_err = 0, 10000, None
>
> for _ in range(n_run):
>     seed = np.random.randint(10000000)
>     err, _, _, _ = run_experiment(X, y, seed)
>     n_err += err
>
>     # keep aside last seed causing an error
>     seed_err = seed if err == 1 else seed_err
>
>
> print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n')
>
> _, y_test, pred, pred_scaled = run_experiment(X, y, seed_err)
>
> print(f'Seed: {seed_err}')
> print(f'All pred equal: {all(pred == pred_scaled)}')
> print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test,
> pred)}')
> print(f'Scale data confusion matrix:\n{confusion_matrix(y_test,
> pred_scaled)}')
> [image: Sent from Mailspring]
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191024/87feea0d/attachment.html>


More information about the scikit-learn mailing list