[scikit-learn] scikit-learn Digest, Vol 43, Issue 38

Fri Oct 25 03:39:09 EDT 2019

Hi,

it's in the making: https://github.com/scikit-learn/scikit-learn/pull/14696

On Fri, Oct 25, 2019 at 4:23 AM WONG Wing Mei <Wong.WingMei at uobgroup.com>
wrote:

> Can I ask whether we can use sample weight in gradient boosting? And how
> to do it?
>
> -----Original Message-----
> From: scikit-learn [mailto:scikit-learn-bounces+wong.wingmei=
> uobgroup.com at python.org] On Behalf Of scikit-learn-request at python.org
> Sent: Friday, October 25, 2019 12:00 AM
> To: scikit-learn at python.org
> Subject: scikit-learn Digest, Vol 43, Issue 38
>
> Send scikit-learn mailing list submissions to
>         scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-request at python.org
>
> You can reach the person managing the list at
>         scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. Re: Decision tree results sometimes different with scaled
>       data (Alexandre Gramfort)
>    2. Reminder: Monday October 28th meeting (Adrin)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 24 Oct 2019 14:09:01 +0200
> From: Alexandre Gramfort <alexandre.gramfort at inria.fr>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Decision tree results sometimes different
>         with scaled data
> Message-ID:
>         <
> CADeotZrh_bXHAqV6WDNRoUt4ZXW_+eObj6_vmwMA50AnahkxgA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> another reason is that we take as threshold the mid point between sample
> values
> which is not invariant to arbitrary scaling of the features
>
> Alex
>
>
>
> On Tue, Oct 22, 2019 at 11:56 AM Guillaume Lema?tre <
> g.lemaitre58 at gmail.com>
> wrote:
>
> > Even with the same random state, it can happen that several features will
> > lead to a best split and this split is chosen randomly (even with the
> seed
> > fixed - this is reported as an issue I think). Therefore, the rest of the
> > tree could be different leading to different prediction.
> >
> > Another possibility is that we compute the difference between the current
> > threshold and the next to be tried and only check the entropy if it is
> > larger than a specific value (I would need to check the source code).
> After
> > scaling, it could happen that 2 feature values become too closed to be
> > considered as a potential split which will make a difference between
> scaled
> > and scaled features. But this diff should be really small.
> >
> > This is the what I can think on the top of the head.
> >
> > Sent from my phone - sorry to be brief and potential misspell.
> > *From:* geoffrey.bolmier at gmail.com
> > *Sent:* 22 October 2019 11:34
> > *To:* scikit-learn at python.org
> > *Reply to:* scikit-learn at python.org
> > *Subject:* [scikit-learn] Decision tree results sometimes different with
> > scaled data
> >
> > Hi all,
> >
> > First, let me thank you for the great job your guys are doing developing
> > and maintaining such a popular library!
> >
> > As we all know decision trees are not impacted by scaled data because
> > splits don't take into account distances between two values within a
> > feature.
> >
> > However I experienced a strange behavior using sklearn decision tree
> > algorithm.  Sometimes results of the model are different depending if
> input
> > data has been scaled or not.
> >
> > To illustrate my point I ran experiments on the iris dataset consisting
> of:
> >
> >    - perform a train/test split
> >    - fit the training set and predict the test set
> >    - fit and predict again with standardized inputs (removing the mean
> >    and scaling to unit variance)
> >    - compare both model predictions
> >
> > Experiments have been ran 10,000 times with different random seeds (cf.
> > traceback and code to reproduce it at the end).
> > Results showed that for a bit more than 10% of the time we find at least
> > one different prediction. Hopefully when it's the case only a few
> > predictions differ, 1 or 2 most of the time. I checked the inputs causing
> > different predictions and they are not the same depending of the run.
> >
> > I'm worried if the rate of different predictions could be larger for
> other
> > datasets...
> > Do you have an idea where it come from, maybe due to floating point
> errors
> > or am I doing something wrong?
> >
> > Cheers,
> > Geoffrey
> >
> >
> > ------------------------------------------------------------
> > Traceback:
> > ------------------------------------------------------------
> > Error rate: 12.22%
> >
> > Seed: 241862
> > All pred equal: False
> > Not scale data confusion matrix:
> > [[16  0  0]
> > [ 0 17  0]
> > [ 0  4 13]]
> > Scale data confusion matrix:
> > [[16  0  0]
> > [ 0 15  2]
> > [ 0  4 13]]
> > ------------------------------------------------------------
> > Code:
> > ------------------------------------------------------------
> > import numpy as np
> >
> > from sklearn.datasets import load_iris
> > from sklearn.metrics import confusion_matrix
> > from sklearn.model_selection import train_test_split
> > from sklearn.preprocessing import StandardScaler
> > from sklearn.tree import DecisionTreeClassifier
> >
> >
> > X, y = load_iris(return_X_y=True)
> >
> > def run_experiment(X, y, seed):
> >     X_train, X_test, y_train, y_test = train_test_split(
> >             X,
> >             y,
> >             stratify=y,
> >             test_size=0.33,
> >             random_state=seed
> >         )
> >
> >     scaler = StandardScaler()
> >
> >     X_train_scaled = scaler.fit_transform(X_train)
> >     X_test_scaled = scaler.transform(X_test)
> >
> >     clf = DecisionTreeClassifier(random_state=seed)
> >     clf_scaled = DecisionTreeClassifier(random_state=seed)
> >
> >     clf.fit(X_train, y_train)
> >     clf_scaled.fit(X_train_scaled, y_train)
> >
> >     pred = clf.predict(X_test)
> >     pred_scaled = clf_scaled.predict(X_test_scaled)
> >
> >     err = 0 if all(pred == pred_scaled) else 1
> >
> >     return err, y_test, pred, pred_scaled
> >
> >
> > n_err, n_run, seed_err = 0, 10000, None
> >
> > for _ in range(n_run):
> >     seed = np.random.randint(10000000)
> >     err, _, _, _ = run_experiment(X, y, seed)
> >     n_err += err
> >
> >     # keep aside last seed causing an error
> >     seed_err = seed if err == 1 else seed_err
> >
> >
> > print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n')
> >
> > _, y_test, pred, pred_scaled = run_experiment(X, y, seed_err)
> >
> > print(f'Seed: {seed_err}')
> > print(f'All pred equal: {all(pred == pred_scaled)}')
> > print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test,
> > pred)}')
> > print(f'Scale data confusion matrix:\n{confusion_matrix(y_test,
> > pred_scaled)}')
> > [image: Sent from Mailspring]
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/scikit-learn/attachments/20191024/87feea0d/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Thu, 24 Oct 2019 17:10:26 +0200
> From: Adrin <adrin.jalali at gmail.com>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: [scikit-learn] Reminder: Monday October 28th meeting
> Message-ID:
>         <
> CAEOrW48htWpXLwZ2daKSbas5utepG6kc_XgrWWvTDoCVTD7oQw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Scikit-learn people,
>
> This is a reminder that we'll be having our monthly call on Monday.
>
> Please put your thoughts and important topics you have in mind on
> the project board:
> https://github.com/scikit-learn/scikit-learn/projects/15
>
> We'll be meeting on https://appear.in/amueller
>
> As usual, it'd be nice to have them on the board before the weekend :)
>
> See you on Monday,
> Adrin.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/scikit-learn/attachments/20191024/377798b6/attachment-0001.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 43, Issue 38
> ********************************************
> UOB EMAIL DISCLAIMER
> Any person receiving this email and any attachment(s) contained,
> shall treat the information as confidential and not misuse, copy,
> disclose, distribute or retain the information in any way that
> amounts to a breach of confidentiality. If you are not the intended
> recipient, please delete all copies of this email from your computer
> system. As the integrity of this message cannot be guaranteed,
> neither UOB nor any entity in the UOB Group shall be responsible for
> the contents. Any opinion in this email may not necessarily represent
> the opinion of UOB or any entity in the UOB Group.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191025/e973bc9b/attachment-0001.html>