From adrin.jalali at gmail.com Thu Jan 2 10:40:35 2020 From: adrin.jalali at gmail.com (Adrin) Date: Thu, 2 Jan 2020 16:40:35 +0100 Subject: [scikit-learn] Using a new random number generator in libsvm and liblinear Message-ID: Hi, liblinear and libsvm use the C `rand()` function which returns number up to 32767 on the windows platform. This PR proposes the following fix: *Fixed a convergence issue in ``libsvm`` and ``liblinear`` on Windows platforms* *impacting all related classifiers and regressors. The random number generator* *used to randomly select coordinates in the coordinate descent algorithm was* *C ``rand()``, that is only able to generate numbers up to ``32767`` on windows* *platform. It was replaced with C++11 ``mt19937``, a Mersenne Twister that* *correctly generates 31bits/63bits random numbers on all platforms. In addition,* *the crude "modulo" postprocessor used to get a random number in a bounded* *interval was replaced by the tweaked Lemire method as suggested by `this blog* *post >`* In order to keep the models consistent across platforms, we'd like to use the same (new) rng on all platforms, which means after this change the generated models may be slightly different to what they are now. We'd like to hear any concerns on the matter from the community, here or on the PR, before merging the fix. Best, Adrin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruchika.work at gmail.com Thu Jan 2 10:45:25 2020 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Thu, 2 Jan 2020 10:45:25 -0500 Subject: [scikit-learn] Using a new random number generator in libsvm and liblinear In-Reply-To: References: Message-ID: OK On Thu, Jan 2, 2020, 10:42 AM Adrin wrote: > Hi, > > liblinear and libsvm use the C `rand()` function which returns number up to > 32767 on the windows platform. This PR > proposes the > following fix: > > *Fixed a convergence issue in ``libsvm`` and ``liblinear`` on Windows > platforms* > *impacting all related classifiers and regressors. The random number > generator* > *used to randomly select coordinates in the coordinate descent algorithm > was* > *C ``rand()``, that is only able to generate numbers up to ``32767`` on > windows* > *platform. It was replaced with C++11 ``mt19937``, a Mersenne Twister that* > *correctly generates 31bits/63bits random numbers on all platforms. In > addition,* > *the crude "modulo" postprocessor used to get a random number in a bounded* > *interval was replaced by the tweaked Lemire method as suggested by `this > blog* > *post >`* > > In order to keep the models consistent across platforms, we'd like to use > the same (new) rng > on all platforms, which means after this change the generated models may > be slightly different > to what they are now. We'd like to hear any concerns on the matter from > the community, here > or on the PR, before merging the fix. > > Best, > Adrin. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Thu Jan 2 12:57:38 2020 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 2 Jan 2020 18:57:38 +0100 Subject: [scikit-learn] scikit-learn 0.22.1 is out! Message-ID: This is a minor release that includes many bug fixes and solves a number of packaging issues with Windows wheels in particular. Here is the full changelog: https://scikit-learn.org/stable/whats_new/v0.22.html#version-0-22-1 The conda package will follow soon (hopefully). Thank you very much to all who contributed to this release! Cheers and happy new year! -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel From alexandre.gramfort at inria.fr Sat Jan 4 07:49:50 2020 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Sat, 4 Jan 2020 13:49:50 +0100 Subject: [scikit-learn] Using a new random number generator in libsvm and liblinear In-Reply-To: References: Message-ID: I don't foresee any issue with that. Alex From gael.varoquaux at normalesup.org Sat Jan 4 15:22:12 2020 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 4 Jan 2020 15:22:12 -0500 Subject: [scikit-learn] Using a new random number generator in libsvm and liblinear In-Reply-To: References: Message-ID: <20200104202212.gwqww6axlq7shmjt@phare.normalesup.org> Me neither. The only drawback that I see is that we have a codebase that is drifting more and more from upstream. But I think that that ship has sailed. G On Sat, Jan 04, 2020 at 01:49:50PM +0100, Alexandre Gramfort wrote: > I don't foresee any issue with that. > Alex > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA Visiting professor, McGill http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From marmochiaskl at gmail.com Mon Jan 6 05:13:40 2020 From: marmochiaskl at gmail.com (Chiara Marmo) Date: Mon, 6 Jan 2020 11:13:40 +0100 Subject: [scikit-learn] Issues for Berlin and Paris Sprints Message-ID: Dear core-devs, First let me wish a Happy New Year to you all! There will be two scikit-learn sprints in January to start this 2020 in a busy way: one in Berlin [1] (Jan 25) and one in Paris [2] (Jan 28-31). I feel like we could benefit of some coordination in selecting the issues for those two events. Reshama Shaikh and I, we are already in touch. I've opened two projects [3][4] to follow-up the issue selection for the sprints. I will check for previous "Sprint" labels in the skl issues and maybe ask for clarification on some of them... please, be patient. The goal is to prepare the two sprints in order to make the review process as efficient as possible: we don't want to waste the reviewer time and we hope to make the PR experience a learning opportunity on both sides. In particular, I would like to ask a favour to all of you: I don't know if this is even always possible, but, IMO, it would be really useful to have a list of two/three reviewers available to check on a specific issue. I am, personally, a bit uncomfortable in pinging core-devs randomly, under the impression of crying wolf lacking for attention... If people in charge are defined in advance this could, I think, smooth the review process. What do you think? Please, let us know if you have any suggestion or recommendation to improve the Sprint organization. Thanks for listening, Best, Chiara [1] https://github.com/WiMLDS/berlin-2020-scikit-sprint [2] https://github.com/scikit-learn/scikit-learn/wiki/Paris-scikit-learn-Sprint-of-the-Decade [3] https://github.com/WiMLDS/berlin-2020-scikit-sprint/projects/1 [4] https://github.com/scikit-learn-fondation/ParisSprintJanuary2020/projects/1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Mon Jan 6 10:11:41 2020 From: adrin.jalali at gmail.com (Adrin) Date: Mon, 6 Jan 2020 16:11:41 +0100 Subject: [scikit-learn] Vote on SLEP010: n_features_in_ attribute In-Reply-To: <18c5d963-0b7a-45ad-bd6d-0c9146be58b3@Canary> References: <4600b19a-c06a-5ed5-0f14-dbf5a0a7cd5b@gmail.com> <18c5d963-0b7a-45ad-bd6d-0c9146be58b3@Canary> Message-ID: According to our governance model, this vote is now closed and accepted, and the implementation shall take the concerns mentioned here into account. Thanks everybody for the attention and the discussion. On Sat, Dec 21, 2019 at 6:36 PM Thomas J Fan wrote: > I am +1. I aggree with Joel that we should look into making these methods > (or maybe functions) usable by external developers. > > Thomas > > On Monday, Dec 16, 2019 at 4:20 PM, Alexandre Gramfort < > alexandre.gramfort at inria.fr> wrote: > +1 on SLEP + adding an estimator tag if it does not apply eg Text > vectorizers etc. > > Alex > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From siddharthgupta234 at gmail.com Tue Jan 7 05:25:01 2020 From: siddharthgupta234 at gmail.com (Siddharth Gupta) Date: Tue, 7 Jan 2020 15:55:01 +0530 Subject: [scikit-learn] Time for Roadmap for the coming years? Message-ID: The last roadmap for Scikit learn available on the official website was posted in 2018. With the onset of 2020s and Python 2.7 no longer receiving bug fixes or security support, I wish scikit-learn could come up with a fresh roadmap for the upcoming years. What are everyone's take and suggestions? Regards Siddharth Gupta, Website Linkedin | Twitter | Facebook -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue Jan 7 05:33:25 2020 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 7 Jan 2020 11:33:25 +0100 Subject: [scikit-learn] Time for Roadmap for the coming years? In-Reply-To: References: Message-ID: Hi, Although that roadmap was written in 2018, we recently updated it and it still stands. Other than that, we also have an issue discussing the version 1.0 milestones: https://github.com/scikit-learn/scikit-learn/issues/14386 Thanks, Adrin. On Tue, Jan 7, 2020 at 11:26 AM Siddharth Gupta wrote: > The last roadmap for Scikit learn available on the official website > was posted in 2018. With > the onset of 2020s and Python 2.7 no longer receiving bug fixes or security > support, I wish scikit-learn could come up with a fresh roadmap for the > upcoming years. What are everyone's take and suggestions? > > Regards > > Siddharth Gupta, > Website > > Linkedin | Twitter > | Facebook > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Tue Jan 7 05:35:19 2020 From: niourf at gmail.com (Nicolas Hug) Date: Tue, 7 Jan 2020 05:35:19 -0500 Subject: [scikit-learn] Time for Roadmap for the coming years? In-Reply-To: References: Message-ID: <51c133df-a1c6-cb10-7663-a8ab76266dc4@gmail.com> The roadmap was updated not so long ago (https://github.com/scikit-learn/scikit-learn/pull/15332) On a related note, we recently discussed defining a roadmap for an eventual 1.0 release https://github.com/scikit-learn/scikit-learn/issues/14386 On 1/7/20 5:25 AM, Siddharth Gupta wrote: > The last roadmap for Scikit learn available on the official website > ?was posted in 2018. > With the onset of 2020s and Python 2.7 no longer receiving bug fixes > or security support, I wish scikit-learn could come up with a fresh > roadmap for the upcoming years. What are everyone's take and suggestions? > > Regards > > Siddharth Gupta, > Website > > Linkedin | Twitter > ?| Facebook > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Jan 7 16:33:37 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 8 Jan 2020 08:33:37 +1100 Subject: [scikit-learn] Time for Roadmap for the coming years? In-Reply-To: <51c133df-a1c6-cb10-7663-a8ab76266dc4@gmail.com> References: <51c133df-a1c6-cb10-7663-a8ab76266dc4@gmail.com> Message-ID: The roadmap includes a statement of purpose as at 2018. I don't think the core developers think the roadmap itself is very outdated. But thanks for the reminder. Joel -------------- next part -------------- An HTML attachment was scrubbed... URL: From benoit.presles at u-bourgogne.fr Wed Jan 8 14:45:59 2020 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Wed, 8 Jan 2020 20:45:59 +0100 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> <586c6024-9bef-3ab8-513d-547913808039@gmail.com> <4d4dc37d-ed57-b512-fcdf-45693ff9e489@u-bourgogne.fr> Message-ID: <9c18b18c-3799-2da6-ec05-b9144aa2557a@u-bourgogne.fr> Dear sklearn users, I still have some issues concerning logistic regression. I did compare on the same data (simulated data) sklearn with three different solvers (lbfgs, saga, liblinear) and statsmodels. When everything goes well, I get the same results between lbfgs, saga, liblinear and statsmodels. When everything goes wrong, all the results are different. In fact, when everything goes wrong, statsmodels gives me a convergence warning (Warning: Maximum number of iterations has been exceeded. Current function value: inf Iterations: 20000) + an error (numpy.linalg.LinAlgError: Singular matrix). Why sklearn does not tell me anything? How can I know that I have convergence issues with sklearn? Thanks for your help, Best regards, Ben -------------------------------------------- Here is the code I used to generate synthetic data: from sklearn.datasets import make_classification from sklearn.model_selection import StratifiedShuffleSplit from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression import statsmodels.api as sm # RANDOM_SEED = 2 # X_sim, y_sim = make_classification(n_samples=200, ?????????????????????????? n_features=20, ?????????????????????????? n_informative=10, ?????????????????????????? n_redundant=0, ?????????????????????????? n_repeated=0, ?????????????????????????? n_classes=2, ?????????????????????????? n_clusters_per_class=1, ?????????????????????????? random_state=RANDOM_SEED, ?????????????????????????? shuffle=False) # sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=RANDOM_SEED) for train_index_split, test_index_split in sss.split(X_sim, y_sim): ??? X_split_train, X_split_test = X_sim[train_index_split], X_sim[test_index_split] ??? y_split_train, y_split_test = y_sim[train_index_split], y_sim[test_index_split] ??? ss = StandardScaler() ??? X_split_train = ss.fit_transform(X_split_train) ??? X_split_test = ss.transform(X_split_test) ??? # ??? classifier_lbfgs = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, ??????????????????????????????????? solver='lbfgs', penalty='none', tol=1e-6) ??? classifier_lbfgs.fit(X_split_train, y_split_train) ??? print('classifier lbfgs iter:',? classifier_lbfgs.n_iter_) ??? print(classifier_lbfgs.intercept_) ??? print(classifier_lbfgs.coef_) ??? # ??? classifier_saga = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, ??????????????????????????????????? solver='saga', penalty='none', tol=1e-6) ??? classifier_saga.fit(X_split_train, y_split_train) ??? print('classifier saga iter:', classifier_saga.n_iter_) ??? print(classifier_saga.intercept_) ??? print(classifier_saga.coef_) ??? # ??? classifier_liblinear = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=0, random_state=RANDOM_SEED, ???????????????????????????????????????? C=1e9, ???????????????????????????????????????? solver='liblinear', penalty='l2', tol=1e-6) ??? classifier_liblinear.fit(X_split_train, y_split_train) ??? print('classifier liblinear iter:', classifier_liblinear.n_iter_) ??? print(classifier_liblinear.intercept_) ??? print(classifier_liblinear.coef_) ??? # statsmodels ??? logit = sm.Logit(y_split_train, sm.tools.add_constant(X_split_train)) ??? logit_res = logit.fit(maxiter=20000) ??? print("Coef statsmodels") ??? print(logit_res.params) On 11/10/2019 15:42, Andreas Mueller wrote: > > > On 10/10/19 1:14 PM, Beno?t Presles wrote: >> >> Thanks for your answers. >> >> On my real data, I do not have so many samples. I have a bit more >> than 200 samples in total and I also would like to get some results >> with unpenalized logisitic regression. >> What do you suggest? Should I switch to the lbfgs solver? > Yes. >> Am I sure that with this solver I will not have any convergence issue >> and always get the good result? Indeed, I did not get any convergence >> warning with saga, so I thought everything was fine. I noticed some >> issues only when I decided to test several solvers. Without comparing >> the results across solvers, how to be sure that the optimisation goes >> well? Shouldn't scikit-learn warn the user somehow if it is not the case? > We should attempt to warn in the SAGA solver if it doesn't converge. > That it doesn't raise a convergence warning should probably be > considered a bug. > It uses the maximum weight change as a stopping criterion right now. > We could probably compute the dual objective once in the end to see if > we converged, right? Or is that not possible with SAGA? If not, we > might want to caution that no convergence warning will be raised. > >> >> At last, I was using saga because I also wanted to do some feature >> selection by using l1 penalty which is not supported by lbfgs... > You can use liblinear then. > > >> >> Best regards, >> Ben >> >> >> Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: >>> Ups I did not see the answer of Roman. Sorry about that. It is >>> coming back to the same conclusion :) >>> >>> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre >>> > wrote: >>> >>> Uhm actually increasing to 10000 samples solve the convergence >>> issue. >>> SAGA is not designed to work with a so small sample size most >>> probably. >>> >>> On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre >>> > wrote: >>> >>> I slightly change the bench such that it uses pipeline and >>> plotted the coefficient: >>> >>> https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 >>> >>> I only see one of the 10 splits where SAGA is not >>> converging, otherwise the coefficients >>> look very close (I don't attach the figure here but they can >>> be plotted using the snippet). >>> So apart from this second split, the other differences seems >>> to be numerical instability. >>> >>> Where I have some concern is regarding the convergence rate >>> of SAGA but I have no >>> intuition to know if this is normal or not. >>> >>> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak >>> > wrote: >>> >>> Ben, >>> >>> I can confirm your results with penalty='none' and >>> C=1e9. In both cases, >>> you are running a mostly unpenalized logisitic >>> regression. Usually >>> that's less numerically stable than with a small >>> regularization, >>> depending on the data collinearity. >>> >>> Running that same code with >>> ? - larger penalty ( smaller C values) >>> ? - or larger number of samples >>> ? yields for me the same coefficients (up to some >>> tolerance). >>> >>> You can also see that SAGA convergence is not good by >>> the fact that it >>> needs 196000 epochs/iterations to converge. >>> >>> Actually, I have often seen convergence issues with SAG >>> on small >>> datasets (in unit tests), not fully sure why. >>> >>> -- >>> Roman >>> >>> On 09/10/2019 22:10, serafim loukas wrote: >>> > The predictions across solver are exactly the same >>> when I run the code. >>> > I am using 0.21.3 version. What is yours? >>> > >>> > >>> > In [13]: import sklearn >>> > >>> > In [14]: sklearn.__version__ >>> > Out[14]: '0.21.3' >>> > >>> > >>> > Serafeim >>> > >>> > >>> > >>> >> On 9 Oct 2019, at 21:44, Beno?t Presles >>> >> >>> >> >> >> wrote: >>> >> >>> >> (y_pred_lbfgs==y_pred_saga).all() == False >>> > >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> > >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> Scikit-learn @ Inria Foundation >>> https://glemaitre.github.io/ >>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> Scikit-learn @ Inria Foundation >>> https://glemaitre.github.io/ >>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> Scikit-learn @ Inria Foundation >>> https://glemaitre.github.io/ >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Wed Jan 8 15:18:27 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 8 Jan 2020 21:18:27 +0100 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: <9c18b18c-3799-2da6-ec05-b9144aa2557a@u-bourgogne.fr> References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> <586c6024-9bef-3ab8-513d-547913808039@gmail.com> <4d4dc37d-ed57-b512-fcdf-45693ff9e489@u-bourgogne.fr> <9c18b18c-3799-2da6-ec05-b9144aa2557a@u-bourgogne.fr> Message-ID: We issue convergence warning. Can you check n_iter to be sure that you did not convergence to the stated convergence? On Wed, 8 Jan 2020 at 20:53, Beno?t Presles wrote: > Dear sklearn users, > > I still have some issues concerning logistic regression. > I did compare on the same data (simulated data) sklearn with three > different solvers (lbfgs, saga, liblinear) and statsmodels. > > When everything goes well, I get the same results between lbfgs, saga, > liblinear and statsmodels. When everything goes wrong, all the results are > different. > > In fact, when everything goes wrong, statsmodels gives me a convergence > warning (Warning: Maximum number of iterations has been exceeded. Current > function value: inf Iterations: 20000) + an error > (numpy.linalg.LinAlgError: Singular matrix). > > Why sklearn does not tell me anything? How can I know that I have > convergence issues with sklearn? > > > Thanks for your help, > Best regards, > Ben > > -------------------------------------------- > > Here is the code I used to generate synthetic data: > > from sklearn.datasets import make_classification > from sklearn.model_selection import StratifiedShuffleSplit > from sklearn.preprocessing import StandardScaler > from sklearn.linear_model import LogisticRegression > import statsmodels.api as sm > # > RANDOM_SEED = 2 > # > X_sim, y_sim = make_classification(n_samples=200, > n_features=20, > n_informative=10, > n_redundant=0, > n_repeated=0, > n_classes=2, > n_clusters_per_class=1, > random_state=RANDOM_SEED, > shuffle=False) > # > sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, > random_state=RANDOM_SEED) > for train_index_split, test_index_split in sss.split(X_sim, y_sim): > X_split_train, X_split_test = X_sim[train_index_split], > X_sim[test_index_split] > y_split_train, y_split_test = y_sim[train_index_split], > y_sim[test_index_split] > ss = StandardScaler() > X_split_train = ss.fit_transform(X_split_train) > X_split_test = ss.transform(X_split_test) > # > classifier_lbfgs = LogisticRegression(fit_intercept=True, > max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, > solver='lbfgs', penalty='none', > tol=1e-6) > classifier_lbfgs.fit(X_split_train, y_split_train) > print('classifier lbfgs iter:', classifier_lbfgs.n_iter_) > print(classifier_lbfgs.intercept_) > print(classifier_lbfgs.coef_) > # > classifier_saga = LogisticRegression(fit_intercept=True, > max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, > solver='saga', penalty='none', > tol=1e-6) > classifier_saga.fit(X_split_train, y_split_train) > print('classifier saga iter:', classifier_saga.n_iter_) > print(classifier_saga.intercept_) > print(classifier_saga.coef_) > # > classifier_liblinear = LogisticRegression(fit_intercept=True, > max_iter=20000000, verbose=0, random_state=RANDOM_SEED, > C=1e9, > solver='liblinear', penalty='l2', > tol=1e-6) > classifier_liblinear.fit(X_split_train, y_split_train) > print('classifier liblinear iter:', classifier_liblinear.n_iter_) > print(classifier_liblinear.intercept_) > print(classifier_liblinear.coef_) > # statsmodels > logit = sm.Logit(y_split_train, sm.tools.add_constant(X_split_train)) > logit_res = logit.fit(maxiter=20000) > print("Coef statsmodels") > print(logit_res.params) > > > > On 11/10/2019 15:42, Andreas Mueller wrote: > > > > On 10/10/19 1:14 PM, Beno?t Presles wrote: > > Thanks for your answers. > On my real data, I do not have so many samples. I have a bit more than 200 > samples in total and I also would like to get some results with unpenalized > logisitic regression. > What do you suggest? Should I switch to the lbfgs solver? > > Yes. > > Am I sure that with this solver I will not have any convergence issue and > always get the good result? Indeed, I did not get any convergence warning > with saga, so I thought everything was fine. I noticed some issues only > when I decided to test several solvers. Without comparing the results > across solvers, how to be sure that the optimisation goes well? Shouldn't > scikit-learn warn the user somehow if it is not the case? > > We should attempt to warn in the SAGA solver if it doesn't converge. That > it doesn't raise a convergence warning should probably be considered a bug. > It uses the maximum weight change as a stopping criterion right now. > We could probably compute the dual objective once in the end to see if we > converged, right? Or is that not possible with SAGA? If not, we might want > to caution that no convergence warning will be raised. > > > At last, I was using saga because I also wanted to do some feature > selection by using l1 penalty which is not supported by lbfgs... > > You can use liblinear then. > > > > Best regards, > Ben > > > Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit : > > Ups I did not see the answer of Roman. Sorry about that. It is coming back > to the same conclusion :) > > On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre > wrote: > >> Uhm actually increasing to 10000 samples solve the convergence issue. >> SAGA is not designed to work with a so small sample size most probably. >> >> On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre >> wrote: >> >>> I slightly change the bench such that it uses pipeline and plotted the >>> coefficient: >>> >>> https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 >>> >>> I only see one of the 10 splits where SAGA is not converging, otherwise >>> the coefficients >>> look very close (I don't attach the figure here but they can be plotted >>> using the snippet). >>> So apart from this second split, the other differences seems to be >>> numerical instability. >>> >>> Where I have some concern is regarding the convergence rate of SAGA but >>> I have no >>> intuition to know if this is normal or not. >>> >>> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak >>> wrote: >>> >>>> Ben, >>>> >>>> I can confirm your results with penalty='none' and C=1e9. In both >>>> cases, >>>> you are running a mostly unpenalized logisitic regression. Usually >>>> that's less numerically stable than with a small regularization, >>>> depending on the data collinearity. >>>> >>>> Running that same code with >>>> - larger penalty ( smaller C values) >>>> - or larger number of samples >>>> yields for me the same coefficients (up to some tolerance). >>>> >>>> You can also see that SAGA convergence is not good by the fact that it >>>> needs 196000 epochs/iterations to converge. >>>> >>>> Actually, I have often seen convergence issues with SAG on small >>>> datasets (in unit tests), not fully sure why. >>>> >>>> -- >>>> Roman >>>> >>>> On 09/10/2019 22:10, serafim loukas wrote: >>>> > The predictions across solver are exactly the same when I run the >>>> code. >>>> > I am using 0.21.3 version. What is yours? >>>> > >>>> > >>>> > In [13]: import sklearn >>>> > >>>> > In [14]: sklearn.__version__ >>>> > Out[14]: '0.21.3' >>>> > >>>> > >>>> > Serafeim >>>> > >>>> > >>>> > >>>> >> On 9 Oct 2019, at 21:44, Beno?t Presles < >>>> benoit.presles at u-bourgogne.fr >>>> >> > wrote: >>>> >> >>>> >> (y_pred_lbfgs==y_pred_saga).all() == False >>>> > >>>> > >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn at python.org >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> Scikit-learn @ Inria Foundation >>> https://glemaitre.github.io/ >>> >> >> >> -- >> Guillaume Lemaitre >> Scikit-learn @ Inria Foundation >> https://glemaitre.github.io/ >> > > > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benoit.presles at u-bourgogne.fr Wed Jan 8 15:31:47 2020 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Wed, 8 Jan 2020 21:31:47 +0100 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> <586c6024-9bef-3ab8-513d-547913808039@gmail.com> <4d4dc37d-ed57-b512-fcdf-45693ff9e489@u-bourgogne.fr> <9c18b18c-3799-2da6-ec05-b9144aa2557a@u-bourgogne.fr> Message-ID: With lbfgs n_iter_ = 48, with saga n_iter_ = 326581, with liblinear n_iter_ = 64. On 08/01/2020 21:18, Guillaume Lema?tre wrote: > We issue convergence warning. Can you check n_iter to be sure that you > did not convergence to the stated convergence? > > On Wed, 8 Jan 2020 at 20:53, Beno?t Presles > > > wrote: > > Dear sklearn users, > > I still have some issues concerning logistic regression. > I did compare on the same data (simulated data) sklearn with three > different solvers (lbfgs, saga, liblinear) and statsmodels. > > When everything goes well, I get the same results between lbfgs, > saga, liblinear and statsmodels. When everything goes wrong, all > the results are different. > > In fact, when everything goes wrong, statsmodels gives me a > convergence warning (Warning: Maximum number of iterations has > been exceeded. Current function value: inf Iterations: 20000) + an > error (numpy.linalg.LinAlgError: Singular matrix). > > Why sklearn does not tell me anything? How can I know that I have > convergence issues with sklearn? > > > Thanks for your help, > Best regards, > Ben > > -------------------------------------------- > > Here is the code I used to generate synthetic data: > > from sklearn.datasets import make_classification > from sklearn.model_selection import StratifiedShuffleSplit > from sklearn.preprocessing import StandardScaler > from sklearn.linear_model import LogisticRegression > import statsmodels.api as sm > # > RANDOM_SEED = 2 > # > X_sim, y_sim = make_classification(n_samples=200, > ?????????????????????????? n_features=20, > ?????????????????????????? n_informative=10, > ?????????????????????????? n_redundant=0, > ?????????????????????????? n_repeated=0, > ?????????????????????????? n_classes=2, > ?????????????????????????? n_clusters_per_class=1, > ?????????????????????????? random_state=RANDOM_SEED, > ?????????????????????????? shuffle=False) > # > sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, > random_state=RANDOM_SEED) > for train_index_split, test_index_split in sss.split(X_sim, y_sim): > ??? X_split_train, X_split_test = X_sim[train_index_split], > X_sim[test_index_split] > ??? y_split_train, y_split_test = y_sim[train_index_split], > y_sim[test_index_split] > ??? ss = StandardScaler() > ??? X_split_train = ss.fit_transform(X_split_train) > ??? X_split_test = ss.transform(X_split_test) > ??? # > ??? classifier_lbfgs = LogisticRegression(fit_intercept=True, > max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, > ??????????????????????????????????? solver='lbfgs', > penalty='none', tol=1e-6) > ??? classifier_lbfgs.fit(X_split_train, y_split_train) > ??? print('classifier lbfgs iter:', classifier_lbfgs.n_iter_) > ??? print(classifier_lbfgs.intercept_) > ??? print(classifier_lbfgs.coef_) > ??? # > ??? classifier_saga = LogisticRegression(fit_intercept=True, > max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, > ??????????????????????????????????? solver='saga', penalty='none', > tol=1e-6) > ??? classifier_saga.fit(X_split_train, y_split_train) > ??? print('classifier saga iter:', classifier_saga.n_iter_) > ??? print(classifier_saga.intercept_) > ??? print(classifier_saga.coef_) > ??? # > ??? classifier_liblinear = LogisticRegression(fit_intercept=True, > max_iter=20000000, verbose=0, random_state=RANDOM_SEED, > ???????????????????????????????????????? C=1e9, > solver='liblinear', penalty='l2', tol=1e-6) > ??? classifier_liblinear.fit(X_split_train, y_split_train) > ??? print('classifier liblinear iter:', classifier_liblinear.n_iter_) > ??? print(classifier_liblinear.intercept_) > ??? print(classifier_liblinear.coef_) > ??? # statsmodels > ??? logit = sm.Logit(y_split_train, > sm.tools.add_constant(X_split_train)) > ??? logit_res = logit.fit(maxiter=20000) > ??? print("Coef statsmodels") > ??? print(logit_res.params) > > > > On 11/10/2019 15:42, Andreas Mueller wrote: >> >> >> On 10/10/19 1:14 PM, Beno?t Presles wrote: >>> >>> Thanks for your answers. >>> >>> On my real data, I do not have so many samples. I have a bit >>> more than 200 samples in total and I also would like to get some >>> results with unpenalized logisitic regression. >>> What do you suggest? Should I switch to the lbfgs solver? >> Yes. >>> Am I sure that with this solver I will not have any convergence >>> issue and always get the good result? Indeed, I did not get any >>> convergence warning with saga, so I thought everything was fine. >>> I noticed some issues only when I decided to test several >>> solvers. Without comparing the results across solvers, how to be >>> sure that the optimisation goes well? Shouldn't scikit-learn >>> warn the user somehow if it is not the case? >> We should attempt to warn in the SAGA solver if it doesn't >> converge. That it doesn't raise a convergence warning should >> probably be considered a bug. >> It uses the maximum weight change as a stopping criterion right now. >> We could probably compute the dual objective once in the end to >> see if we converged, right? Or is that not possible with SAGA? If >> not, we might want to caution that no convergence warning will be >> raised. >> >>> >>> At last, I was using saga because I also wanted to do some >>> feature selection by using l1 penalty which is not supported by >>> lbfgs... >> You can use liblinear then. >> >> >>> >>> Best regards, >>> Ben >>> >>> >>> Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: >>>> Ups I did not see the answer of Roman. Sorry about that. It is >>>> coming back to the same conclusion :) >>>> >>>> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre >>>> > wrote: >>>> >>>> Uhm actually increasing to 10000 samples solve the >>>> convergence issue. >>>> SAGA is not designed to work with a so small sample size >>>> most probably. >>>> >>>> On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre >>>> > wrote: >>>> >>>> I slightly change the bench such that it uses pipeline >>>> and plotted the coefficient: >>>> >>>> https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 >>>> >>>> I only see one of the 10 splits where SAGA is not >>>> converging, otherwise the coefficients >>>> look very close (I don't attach the figure here but >>>> they can be plotted using the snippet). >>>> So apart from this second split, the other differences >>>> seems to be numerical instability. >>>> >>>> Where I have some concern is regarding the convergence >>>> rate of SAGA but I have no >>>> intuition to know if this is normal or not. >>>> >>>> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak >>>> > >>>> wrote: >>>> >>>> Ben, >>>> >>>> I can confirm your results with penalty='none' and >>>> C=1e9. In both cases, >>>> you are running a mostly unpenalized logisitic >>>> regression. Usually >>>> that's less numerically stable than with a small >>>> regularization, >>>> depending on the data collinearity. >>>> >>>> Running that same code with >>>> ? - larger penalty ( smaller C values) >>>> ? - or larger number of samples >>>> ? yields for me the same coefficients (up to some >>>> tolerance). >>>> >>>> You can also see that SAGA convergence is not good >>>> by the fact that it >>>> needs 196000 epochs/iterations to converge. >>>> >>>> Actually, I have often seen convergence issues with >>>> SAG on small >>>> datasets (in unit tests), not fully sure why. >>>> >>>> -- >>>> Roman >>>> >>>> On 09/10/2019 22:10, serafim loukas wrote: >>>> > The predictions across solver are exactly the >>>> same when I run the code. >>>> > I am using 0.21.3 version. What is yours? >>>> > >>>> > >>>> > In [13]: import sklearn >>>> > >>>> > In [14]: sklearn.__version__ >>>> > Out[14]: '0.21.3' >>>> > >>>> > >>>> > Serafeim >>>> > >>>> > >>>> > >>>> >> On 9 Oct 2019, at 21:44, Beno?t Presles >>>> >>> >>>> >> >>> >> wrote: >>>> >> >>>> >> (y_pred_lbfgs==y_pred_saga).all() == False >>>> > >>>> > >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn at python.org >>>> >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> >>>> -- >>>> Guillaume Lemaitre >>>> Scikit-learn @ Inria Foundation >>>> https://glemaitre.github.io/ >>>> >>>> >>>> >>>> -- >>>> Guillaume Lemaitre >>>> Scikit-learn @ Inria Foundation >>>> https://glemaitre.github.io/ >>>> >>>> >>>> >>>> -- >>>> Guillaume Lemaitre >>>> Scikit-learn @ Inria Foundation >>>> https://glemaitre.github.io/ >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Jan 8 15:53:47 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 8 Jan 2020 15:53:47 -0500 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> <586c6024-9bef-3ab8-513d-547913808039@gmail.com> <4d4dc37d-ed57-b512-fcdf-45693ff9e489@u-bourgogne.fr> <9c18b18c-3799-2da6-ec05-b9144aa2557a@u-bourgogne.fr> Message-ID: <63a99190-1a1e-059d-ba16-a8fcc25a1dc5@gmail.com> Hi Ben. Liblinear and l-bfgs might both converge but to different solutions, given that the intercept is penalized. There is also problems with ill-conditioned problems that are hard to detect. My impression of SAGA was that the convergence checks are too loose and we should improve them. Have you checked the objective of the l-bfgs and liblinear solvers? With ill-conditioned data the objectives could be similar with different solutions. It's not intended for scikit-learn to warn about ill-conditioned problems, I think, only convergence issues. Hth, Andy On 1/8/20 3:31 PM, Beno?t Presles wrote: > With lbfgs n_iter_ = 48, with saga n_iter_ = 326581, with liblinear > n_iter_ = 64. > > > On 08/01/2020 21:18, Guillaume Lema?tre wrote: >> We issue convergence warning. Can you check n_iter to be sure that >> you did not convergence to the stated convergence? >> >> On Wed, 8 Jan 2020 at 20:53, Beno?t Presles >> > > wrote: >> >> Dear sklearn users, >> >> I still have some issues concerning logistic regression. >> I did compare on the same data (simulated data) sklearn with >> three different solvers (lbfgs, saga, liblinear) and statsmodels. >> >> When everything goes well, I get the same results between lbfgs, >> saga, liblinear and statsmodels. When everything goes wrong, all >> the results are different. >> >> In fact, when everything goes wrong, statsmodels gives me a >> convergence warning (Warning: Maximum number of iterations has >> been exceeded. Current function value: inf Iterations: 20000) + >> an error (numpy.linalg.LinAlgError: Singular matrix). >> >> Why sklearn does not tell me anything? How can I know that I have >> convergence issues with sklearn? >> >> >> Thanks for your help, >> Best regards, >> Ben >> >> -------------------------------------------- >> >> Here is the code I used to generate synthetic data: >> >> from sklearn.datasets import make_classification >> from sklearn.model_selection import StratifiedShuffleSplit >> from sklearn.preprocessing import StandardScaler >> from sklearn.linear_model import LogisticRegression >> import statsmodels.api as sm >> # >> RANDOM_SEED = 2 >> # >> X_sim, y_sim = make_classification(n_samples=200, >> ?????????????????????????? n_features=20, >> ?????????????????????????? n_informative=10, >> ?????????????????????????? n_redundant=0, >> ?????????????????????????? n_repeated=0, >> ?????????????????????????? n_classes=2, >> ?????????????????????????? n_clusters_per_class=1, >> ?????????????????????????? random_state=RANDOM_SEED, >> ?????????????????????????? shuffle=False) >> # >> sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, >> random_state=RANDOM_SEED) >> for train_index_split, test_index_split in sss.split(X_sim, y_sim): >> ??? X_split_train, X_split_test = X_sim[train_index_split], >> X_sim[test_index_split] >> ??? y_split_train, y_split_test = y_sim[train_index_split], >> y_sim[test_index_split] >> ??? ss = StandardScaler() >> ??? X_split_train = ss.fit_transform(X_split_train) >> ??? X_split_test = ss.transform(X_split_test) >> ??? # >> ??? classifier_lbfgs = LogisticRegression(fit_intercept=True, >> max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, >> ??????????????????????????????????? solver='lbfgs', >> penalty='none', tol=1e-6) >> ??? classifier_lbfgs.fit(X_split_train, y_split_train) >> ??? print('classifier lbfgs iter:', classifier_lbfgs.n_iter_) >> ??? print(classifier_lbfgs.intercept_) >> ??? print(classifier_lbfgs.coef_) >> ??? # >> ??? classifier_saga = LogisticRegression(fit_intercept=True, >> max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, >> ??????????????????????????????????? solver='saga', >> penalty='none', tol=1e-6) >> ??? classifier_saga.fit(X_split_train, y_split_train) >> ??? print('classifier saga iter:', classifier_saga.n_iter_) >> ??? print(classifier_saga.intercept_) >> ??? print(classifier_saga.coef_) >> ??? # >> ??? classifier_liblinear = LogisticRegression(fit_intercept=True, >> max_iter=20000000, verbose=0, random_state=RANDOM_SEED, >> ???????????????????????????????????????? C=1e9, >> solver='liblinear', penalty='l2', tol=1e-6) >> ??? classifier_liblinear.fit(X_split_train, y_split_train) >> ??? print('classifier liblinear iter:', classifier_liblinear.n_iter_) >> ??? print(classifier_liblinear.intercept_) >> ??? print(classifier_liblinear.coef_) >> ??? # statsmodels >> ??? logit = sm.Logit(y_split_train, >> sm.tools.add_constant(X_split_train)) >> ??? logit_res = logit.fit(maxiter=20000) >> ??? print("Coef statsmodels") >> ??? print(logit_res.params) >> >> >> >> On 11/10/2019 15:42, Andreas Mueller wrote: >>> >>> >>> On 10/10/19 1:14 PM, Beno?t Presles wrote: >>>> >>>> Thanks for your answers. >>>> >>>> On my real data, I do not have so many samples. I have a bit >>>> more than 200 samples in total and I also would like to get >>>> some results with unpenalized logisitic regression. >>>> What do you suggest? Should I switch to the lbfgs solver? >>> Yes. >>>> Am I sure that with this solver I will not have any convergence >>>> issue and always get the good result? Indeed, I did not get any >>>> convergence warning with saga, so I thought everything was >>>> fine. I noticed some issues only when I decided to test several >>>> solvers. Without comparing the results across solvers, how to >>>> be sure that the optimisation goes well? Shouldn't scikit-learn >>>> warn the user somehow if it is not the case? >>> We should attempt to warn in the SAGA solver if it doesn't >>> converge. That it doesn't raise a convergence warning should >>> probably be considered a bug. >>> It uses the maximum weight change as a stopping criterion right now. >>> We could probably compute the dual objective once in the end to >>> see if we converged, right? Or is that not possible with SAGA? >>> If not, we might want to caution that no convergence warning >>> will be raised. >>> >>>> >>>> At last, I was using saga because I also wanted to do some >>>> feature selection by using l1 penalty which is not supported by >>>> lbfgs... >>> You can use liblinear then. >>> >>> >>>> >>>> Best regards, >>>> Ben >>>> >>>> >>>> Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: >>>>> Ups I did not see the answer of Roman. Sorry about that. It is >>>>> coming back to the same conclusion :) >>>>> >>>>> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre >>>>> > wrote: >>>>> >>>>> Uhm actually increasing to 10000 samples solve the >>>>> convergence issue. >>>>> SAGA is not designed to work with a so small sample size >>>>> most probably. >>>>> >>>>> On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre >>>>> > >>>>> wrote: >>>>> >>>>> I slightly change the bench such that it uses pipeline >>>>> and plotted the coefficient: >>>>> >>>>> https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 >>>>> >>>>> I only see one of the 10 splits where SAGA is not >>>>> converging, otherwise the coefficients >>>>> look very close (I don't attach the figure here but >>>>> they can be plotted using the snippet). >>>>> So apart from this second split, the other differences >>>>> seems to be numerical instability. >>>>> >>>>> Where I have some concern is regarding the convergence >>>>> rate of SAGA but I have no >>>>> intuition to know if this is normal or not. >>>>> >>>>> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak >>>>> > >>>>> wrote: >>>>> >>>>> Ben, >>>>> >>>>> I can confirm your results with penalty='none' and >>>>> C=1e9. In both cases, >>>>> you are running a mostly unpenalized logisitic >>>>> regression. Usually >>>>> that's less numerically stable than with a small >>>>> regularization, >>>>> depending on the data collinearity. >>>>> >>>>> Running that same code with >>>>> ? - larger penalty ( smaller C values) >>>>> ? - or larger number of samples >>>>> ? yields for me the same coefficients (up to some >>>>> tolerance). >>>>> >>>>> You can also see that SAGA convergence is not good >>>>> by the fact that it >>>>> needs 196000 epochs/iterations to converge. >>>>> >>>>> Actually, I have often seen convergence issues >>>>> with SAG on small >>>>> datasets (in unit tests), not fully sure why. >>>>> >>>>> -- >>>>> Roman >>>>> >>>>> On 09/10/2019 22:10, serafim loukas wrote: >>>>> > The predictions across solver are exactly the >>>>> same when I run the code. >>>>> > I am using 0.21.3 version. What is yours? >>>>> > >>>>> > >>>>> > In [13]: import sklearn >>>>> > >>>>> > In [14]: sklearn.__version__ >>>>> > Out[14]: '0.21.3' >>>>> > >>>>> > >>>>> > Serafeim >>>>> > >>>>> > >>>>> > >>>>> >> On 9 Oct 2019, at 21:44, Beno?t Presles >>>>> >>>> >>>>> >> >>>> >> wrote: >>>>> >> >>>>> >> (y_pred_lbfgs==y_pred_saga).all() == False >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > scikit-learn mailing list >>>>> > scikit-learn at python.org >>>>> >>>>> > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> > >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> >>>>> -- >>>>> Guillaume Lemaitre >>>>> Scikit-learn @ Inria Foundation >>>>> https://glemaitre.github.io/ >>>>> >>>>> >>>>> >>>>> -- >>>>> Guillaume Lemaitre >>>>> Scikit-learn @ Inria Foundation >>>>> https://glemaitre.github.io/ >>>>> >>>>> >>>>> >>>>> -- >>>>> Guillaume Lemaitre >>>>> Scikit-learn @ Inria Foundation >>>>> https://glemaitre.github.io/ >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> -- >> Guillaume Lemaitre >> Scikit-learn @ Inria Foundation >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Jan 8 21:22:38 2020 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 9 Jan 2020 10:22:38 +0800 Subject: [scikit-learn] Why ridge regression can solve multicollinearity? Message-ID: I find out many blogs said that the l2 regularization solve multicollinearity, but they don't said how it works. I thought LASSO is able to select features by l1 regularization, maybe it also can solve this. Can anyone tell me how ridge works with multicollinearity great? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Wed Jan 8 21:31:23 2020 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Wed, 8 Jan 2020 18:31:23 -0800 Subject: [scikit-learn] Why ridge regression can solve multicollinearity? In-Reply-To: References: Message-ID: Correlated features typically have the property that they are tending to be similarly predictive of the outcome. L1 and L2 are both a preference for low coefficients. If a coefficient can be reduced yet another coefficient maintains similar loss, the these regularization methods prefer this solution. If you use L1 or L2, you should mean and variance normalize your features. On Wed, Jan 8, 2020 at 6:24 PM lampahome wrote: > I find out many blogs said that the l2 regularization solve > multicollinearity, but they don't said how it works. > > I thought LASSO is able to select features by l1 regularization, maybe it > also can solve this. > > Can anyone tell me how ridge works with multicollinearity great? > > thx > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Jan 8 21:38:02 2020 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 9 Jan 2020 10:38:02 +0800 Subject: [scikit-learn] Why ridge regression can solve multicollinearity? In-Reply-To: References: Message-ID: Stuart Reynolds ? 2020?1?9? ?? ??10:33??? > Correlated features typically have the property that they are tending to > be similarly predictive of the outcome. > > L1 and L2 are both a preference for low coefficients. > If a coefficient can be reduced yet another coefficient maintains similar > loss, the these regularization methods prefer this solution. > If you use L1 or L2, you should mean and variance normalize your features. > > You mean LASSO and RIDGE both solve multilinearity? -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Wed Jan 8 21:43:54 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Wed, 8 Jan 2020 21:43:54 -0500 Subject: [scikit-learn] Why ridge regression can solve multicollinearity? In-Reply-To: References: Message-ID: On Wed, Jan 8, 2020 at 9:38 PM lampahome wrote: > > > Stuart Reynolds ? 2020?1?9? ?? ??10:33??? > >> Correlated features typically have the property that they are tending to >> be similarly predictive of the outcome. >> >> L1 and L2 are both a preference for low coefficients. >> If a coefficient can be reduced yet another coefficient maintains similar >> loss, the these regularization methods prefer this solution. >> If you use L1 or L2, you should mean and variance normalize your features. >> >> > You mean LASSO and RIDGE both solve multilinearity? > LASSO has the reputation not to be good when there is multicollinearity, that's why elastic net L1 + L2 was introduced, AFAIK With multicollinearity the length of the parameter vector, beta' beta, is too large and L2, Ridge shrinks it. Josef > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Wed Jan 8 21:47:01 2020 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Wed, 8 Jan 2020 21:47:01 -0500 Subject: [scikit-learn] Why ridge regression can solve multicollinearity? In-Reply-To: References: Message-ID: On Wed, Jan 8, 2020 at 9:43 PM wrote: > > > On Wed, Jan 8, 2020 at 9:38 PM lampahome wrote: > >> >> >> Stuart Reynolds ? 2020?1?9? ?? ??10:33??? >> >>> Correlated features typically have the property that they are tending to >>> be similarly predictive of the outcome. >>> >>> L1 and L2 are both a preference for low coefficients. >>> If a coefficient can be reduced yet another coefficient maintains >>> similar loss, the these regularization methods prefer this solution. >>> If you use L1 or L2, you should mean and variance normalize your >>> features. >>> >>> >> You mean LASSO and RIDGE both solve multilinearity? >> > > LASSO has the reputation not to be good when there is multicollinearity, > that's why elastic net L1 + L2 was introduced, AFAIK > > With multicollinearity the length of the parameter vector, beta' beta, is > too large and L2, Ridge shrinks it. > e.g. Marquardt, Donald W., and Ronald D. Snee. "Ridge regression in practice." *The American Statistician* 29, no. 1 (1975): 3-20. I just went through it last week because of a argument about variance inflation factor in Ridge > > Josef > > > >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Wed Jan 8 23:54:15 2020 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Thu, 9 Jan 2020 13:54:15 +0900 Subject: [scikit-learn] Why ridge regression can solve multicollinearity? In-Reply-To: References: Message-ID: Just for convenience: Marquardt, Donald W., and Ronald D. Snee. "Ridge regression in practice." *The > American Statistician* 29, no. 1 (1975): 3-20. > https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1975.10479105 -------------- next part -------------- An HTML attachment was scrubbed... URL: From adityaselfefficient at gmail.com Thu Jan 9 01:22:06 2020 From: adityaselfefficient at gmail.com (aditya aggarwal) Date: Thu, 9 Jan 2020 11:52:06 +0530 Subject: [scikit-learn] Changes in the classifier Message-ID: Hello I'm trying to change the entropy function which is used in sklearn for DecisionTreeClassification locally on my system. when I rerun the pip install --editable . command after updating the cython file, I receive the following error message: Error compiling Cython file: ------------------------------------------------------------ ... for k in range(self.n_outputs): for c in range(n_classes[k]): count_k = sum_total[c] if count_k > 0.0: count_k /= self.weighted_n_node_samples entropy -= count_k * np.log2(count_k) ^ ------------------------------------------------------------ sklearn/tree/_criterion.pyx:537:20: Coercion from Python not allowed without the GIL This error is persisten with other errors as: Operation not allowed without gil Converting to Python object not allowed without gil Converting to Python object not allowed without gil Calling gil-requiring function not allowed without gil Accessing Python attribute not allowed without gil Accessing Python global or builtin not allowed without gil I've tried looking up for solution on various sites, but could not resolve the issue. Any help would be appreciated. Thanks and regards Aditya Aggarwal -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Thu Jan 9 04:09:22 2020 From: adrin.jalali at gmail.com (Adrin) Date: Thu, 9 Jan 2020 10:09:22 +0100 Subject: [scikit-learn] Changes in the classifier In-Reply-To: References: Message-ID: Outside GIL, you can't really work easily with Python objects. You should instead stick to local C variables and routines. For instance, instead of numpy routines, you can use cmath routines. The cython book ( https://www.amazon.com/Cython-Programmers-Kurt-W-Smith/dp/1491901551) and Nicolas's post (http://nicolas-hug.com/blog/cython_notes) may give you some hints. On Thu, Jan 9, 2020 at 7:24 AM aditya aggarwal < adityaselfefficient at gmail.com> wrote: > Hello > > I'm trying to change the entropy function which is used in sklearn for > DecisionTreeClassification locally on my system. > when I rerun the pip install --editable . command after updating the > cython file, I receive the following error message: > > Error compiling Cython file: > ------------------------------------------------------------ > ... > for k in range(self.n_outputs): > for c in range(n_classes[k]): > count_k = sum_total[c] > if count_k > 0.0: > count_k /= self.weighted_n_node_samples > entropy -= count_k * np.log2(count_k) > ^ > ------------------------------------------------------------ > > sklearn/tree/_criterion.pyx:537:20: Coercion from Python not allowed > without the GIL > > This error is persisten with other errors as: > > Operation not allowed without gil > Converting to Python object not allowed without gil > Converting to Python object not allowed without gil > Calling gil-requiring function not allowed without gil > Accessing Python attribute not allowed without gil > Accessing Python global or builtin not allowed without gil > > > I've tried looking up for solution on various sites, but could not resolve > the issue. > Any help would be appreciated. > > Thanks and regards > Aditya Aggarwal > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benoit.presles at u-bourgogne.fr Thu Jan 9 09:22:37 2020 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Thu, 9 Jan 2020 15:22:37 +0100 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: <63a99190-1a1e-059d-ba16-a8fcc25a1dc5@gmail.com> References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> <586c6024-9bef-3ab8-513d-547913808039@gmail.com> <4d4dc37d-ed57-b512-fcdf-45693ff9e489@u-bourgogne.fr> <9c18b18c-3799-2da6-ec05-b9144aa2557a@u-bourgogne.fr> <63a99190-1a1e-059d-ba16-a8fcc25a1dc5@gmail.com> Message-ID: <10a6a7c1-4bf5-6e1e-d352-293bd1c9ea1d@u-bourgogne.fr> Hi Andy, As you can notice in the code, I fixed C=1e9, so the intercept with liblinear is not penalised and therefore I get the same solutions with these solvers when everything goes well. How can I check the objective of the l-bfgs and liblinear solvers with sklearn? Best regards, Ben On 08/01/2020 21:53, Andreas Mueller wrote: > Hi Ben. > > Liblinear and l-bfgs might both converge but to different solutions, > given that the intercept is penalized. > There is also problems with ill-conditioned problems that are hard to > detect. > My impression of SAGA was that the convergence checks are too loose > and we should improve them. > Have you checked the objective of the l-bfgs and liblinear solvers? > With ill-conditioned data the objectives could be similar with > different solutions. > > It's not intended for scikit-learn to warn about ill-conditioned > problems, I think, only convergence issues. > > Hth, > Andy > > > On 1/8/20 3:31 PM, Beno?t Presles wrote: >> With lbfgs n_iter_ = 48, with saga n_iter_ = 326581, with liblinear >> n_iter_ = 64. >> >> >> On 08/01/2020 21:18, Guillaume Lema?tre wrote: >>> We issue convergence warning. Can you check n_iter to be sure that >>> you did not convergence to the stated convergence? >>> >>> On Wed, 8 Jan 2020 at 20:53, Beno?t Presles >>> >> > wrote: >>> >>> Dear sklearn users, >>> >>> I still have some issues concerning logistic regression. >>> I did compare on the same data (simulated data) sklearn with >>> three different solvers (lbfgs, saga, liblinear) and statsmodels. >>> >>> When everything goes well, I get the same results between lbfgs, >>> saga, liblinear and statsmodels. When everything goes wrong, all >>> the results are different. >>> >>> In fact, when everything goes wrong, statsmodels gives me a >>> convergence warning (Warning: Maximum number of iterations has >>> been exceeded. Current function value: inf Iterations: 20000) + >>> an error (numpy.linalg.LinAlgError: Singular matrix). >>> >>> Why sklearn does not tell me anything? How can I know that I >>> have convergence issues with sklearn? >>> >>> >>> Thanks for your help, >>> Best regards, >>> Ben >>> >>> -------------------------------------------- >>> >>> Here is the code I used to generate synthetic data: >>> >>> from sklearn.datasets import make_classification >>> from sklearn.model_selection import StratifiedShuffleSplit >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.linear_model import LogisticRegression >>> import statsmodels.api as sm >>> # >>> RANDOM_SEED = 2 >>> # >>> X_sim, y_sim = make_classification(n_samples=200, >>> ?????????????????????????? n_features=20, >>> ?????????????????????????? n_informative=10, >>> ?????????????????????????? n_redundant=0, >>> ?????????????????????????? n_repeated=0, >>> ?????????????????????????? n_classes=2, >>> ?????????????????????????? n_clusters_per_class=1, >>> ?????????????????????????? random_state=RANDOM_SEED, >>> ?????????????????????????? shuffle=False) >>> # >>> sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, >>> random_state=RANDOM_SEED) >>> for train_index_split, test_index_split in sss.split(X_sim, y_sim): >>> ??? X_split_train, X_split_test = X_sim[train_index_split], >>> X_sim[test_index_split] >>> ??? y_split_train, y_split_test = y_sim[train_index_split], >>> y_sim[test_index_split] >>> ??? ss = StandardScaler() >>> ??? X_split_train = ss.fit_transform(X_split_train) >>> ??? X_split_test = ss.transform(X_split_test) >>> ??? # >>> ??? classifier_lbfgs = LogisticRegression(fit_intercept=True, >>> max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, >>> ??????????????????????????????????? solver='lbfgs', >>> penalty='none', tol=1e-6) >>> ??? classifier_lbfgs.fit(X_split_train, y_split_train) >>> ??? print('classifier lbfgs iter:', classifier_lbfgs.n_iter_) >>> ??? print(classifier_lbfgs.intercept_) >>> ??? print(classifier_lbfgs.coef_) >>> ??? # >>> ??? classifier_saga = LogisticRegression(fit_intercept=True, >>> max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, >>> ??????????????????????????????????? solver='saga', >>> penalty='none', tol=1e-6) >>> ??? classifier_saga.fit(X_split_train, y_split_train) >>> ??? print('classifier saga iter:', classifier_saga.n_iter_) >>> ??? print(classifier_saga.intercept_) >>> ??? print(classifier_saga.coef_) >>> ??? # >>> ??? classifier_liblinear = >>> LogisticRegression(fit_intercept=True, max_iter=20000000, >>> verbose=0, random_state=RANDOM_SEED, >>> ???????????????????????????????????????? C=1e9, >>> solver='liblinear', penalty='l2', tol=1e-6) >>> ??? classifier_liblinear.fit(X_split_train, y_split_train) >>> ??? print('classifier liblinear iter:', >>> classifier_liblinear.n_iter_) >>> ??? print(classifier_liblinear.intercept_) >>> ??? print(classifier_liblinear.coef_) >>> ??? # statsmodels >>> ??? logit = sm.Logit(y_split_train, >>> sm.tools.add_constant(X_split_train)) >>> ??? logit_res = logit.fit(maxiter=20000) >>> ??? print("Coef statsmodels") >>> ??? print(logit_res.params) >>> >>> >>> >>> On 11/10/2019 15:42, Andreas Mueller wrote: >>>> >>>> >>>> On 10/10/19 1:14 PM, Beno?t Presles wrote: >>>>> >>>>> Thanks for your answers. >>>>> >>>>> On my real data, I do not have so many samples. I have a bit >>>>> more than 200 samples in total and I also would like to get >>>>> some results with unpenalized logisitic regression. >>>>> What do you suggest? Should I switch to the lbfgs solver? >>>> Yes. >>>>> Am I sure that with this solver I will not have any >>>>> convergence issue and always get the good result? Indeed, I >>>>> did not get any convergence warning with saga, so I thought >>>>> everything was fine. I noticed some issues only when I decided >>>>> to test several solvers. Without comparing the results across >>>>> solvers, how to be sure that the optimisation goes well? >>>>> Shouldn't scikit-learn warn the user somehow if it is not the >>>>> case? >>>> We should attempt to warn in the SAGA solver if it doesn't >>>> converge. That it doesn't raise a convergence warning should >>>> probably be considered a bug. >>>> It uses the maximum weight change as a stopping criterion right >>>> now. >>>> We could probably compute the dual objective once in the end to >>>> see if we converged, right? Or is that not possible with SAGA? >>>> If not, we might want to caution that no convergence warning >>>> will be raised. >>>> >>>>> >>>>> At last, I was using saga because I also wanted to do some >>>>> feature selection by using l1 penalty which is not supported >>>>> by lbfgs... >>>> You can use liblinear then. >>>> >>>> >>>>> >>>>> Best regards, >>>>> Ben >>>>> >>>>> >>>>> Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: >>>>>> Ups I did not see the answer of Roman. Sorry about that. It >>>>>> is coming back to the same conclusion :) >>>>>> >>>>>> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre >>>>>> > wrote: >>>>>> >>>>>> Uhm actually increasing to 10000 samples solve the >>>>>> convergence issue. >>>>>> SAGA is not designed to work with a so small sample size >>>>>> most probably. >>>>>> >>>>>> On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre >>>>>> > >>>>>> wrote: >>>>>> >>>>>> I slightly change the bench such that it uses >>>>>> pipeline and plotted the coefficient: >>>>>> >>>>>> https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 >>>>>> >>>>>> I only see one of the 10 splits where SAGA is not >>>>>> converging, otherwise the coefficients >>>>>> look very close (I don't attach the figure here but >>>>>> they can be plotted using the snippet). >>>>>> So apart from this second split, the other >>>>>> differences seems to be numerical instability. >>>>>> >>>>>> Where I have some concern is regarding the >>>>>> convergence rate of SAGA but I have no >>>>>> intuition to know if this is normal or not. >>>>>> >>>>>> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak >>>>>> >>>>> > wrote: >>>>>> >>>>>> Ben, >>>>>> >>>>>> I can confirm your results with penalty='none' >>>>>> and C=1e9. In both cases, >>>>>> you are running a mostly unpenalized logisitic >>>>>> regression. Usually >>>>>> that's less numerically stable than with a small >>>>>> regularization, >>>>>> depending on the data collinearity. >>>>>> >>>>>> Running that same code with >>>>>> ? - larger penalty ( smaller C values) >>>>>> ? - or larger number of samples >>>>>> ? yields for me the same coefficients (up to some >>>>>> tolerance). >>>>>> >>>>>> You can also see that SAGA convergence is not >>>>>> good by the fact that it >>>>>> needs 196000 epochs/iterations to converge. >>>>>> >>>>>> Actually, I have often seen convergence issues >>>>>> with SAG on small >>>>>> datasets (in unit tests), not fully sure why. >>>>>> >>>>>> -- >>>>>> Roman >>>>>> >>>>>> On 09/10/2019 22:10, serafim loukas wrote: >>>>>> > The predictions across solver are exactly the >>>>>> same when I run the code. >>>>>> > I am using 0.21.3 version. What is yours? >>>>>> > >>>>>> > >>>>>> > In [13]: import sklearn >>>>>> > >>>>>> > In [14]: sklearn.__version__ >>>>>> > Out[14]: '0.21.3' >>>>>> > >>>>>> > >>>>>> > Serafeim >>>>>> > >>>>>> > >>>>>> > >>>>>> >> On 9 Oct 2019, at 21:44, Beno?t Presles >>>>>> >>>>> >>>>>> >> >>>>> >> wrote: >>>>>> >> >>>>>> >> (y_pred_lbfgs==y_pred_saga).all() == False >>>>>> > >>>>>> > >>>>>> > _______________________________________________ >>>>>> > scikit-learn mailing list >>>>>> > scikit-learn at python.org >>>>>> >>>>>> > >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> > >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Guillaume Lemaitre >>>>>> Scikit-learn @ Inria Foundation >>>>>> https://glemaitre.github.io/ >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Guillaume Lemaitre >>>>>> Scikit-learn @ Inria Foundation >>>>>> https://glemaitre.github.io/ >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Guillaume Lemaitre >>>>>> Scikit-learn @ Inria Foundation >>>>>> https://glemaitre.github.io/ >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> Scikit-learn @ Inria Foundation >>> https://glemaitre.github.io/ >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From adityaselfefficient at gmail.com Mon Jan 13 06:11:46 2020 From: adityaselfefficient at gmail.com (aditya aggarwal) Date: Mon, 13 Jan 2020 16:41:46 +0530 Subject: [scikit-learn] Changes in the classifier In-Reply-To: References: Message-ID: I just want to change the log function while calculating entropy, how can I do this in the scikit library? On Thu, Jan 9, 2020 at 2:41 PM Adrin wrote: > Outside GIL, you can't really work easily with Python objects. You should > instead stick to local C variables and routines. For instance, instead of > numpy routines, you can use cmath routines. > > The cython book ( > https://www.amazon.com/Cython-Programmers-Kurt-W-Smith/dp/1491901551) > and Nicolas's post (http://nicolas-hug.com/blog/cython_notes) may give > you some hints. > > On Thu, Jan 9, 2020 at 7:24 AM aditya aggarwal < > adityaselfefficient at gmail.com> wrote: > >> Hello >> >> I'm trying to change the entropy function which is used in sklearn for >> DecisionTreeClassification locally on my system. >> when I rerun the pip install --editable . command after updating the >> cython file, I receive the following error message: >> >> Error compiling Cython file: >> ------------------------------------------------------------ >> ... >> for k in range(self.n_outputs): >> for c in range(n_classes[k]): >> count_k = sum_total[c] >> if count_k > 0.0: >> count_k /= self.weighted_n_node_samples >> entropy -= count_k * np.log2(count_k) >> ^ >> ------------------------------------------------------------ >> >> sklearn/tree/_criterion.pyx:537:20: Coercion from Python not allowed >> without the GIL >> >> This error is persisten with other errors as: >> >> Operation not allowed without gil >> Converting to Python object not allowed without gil >> Converting to Python object not allowed without gil >> Calling gil-requiring function not allowed without gil >> Accessing Python attribute not allowed without gil >> Accessing Python global or builtin not allowed without gil >> >> >> I've tried looking up for solution on various sites, but could not >> resolve the issue. >> Any help would be appreciated. >> >> Thanks and regards >> Aditya Aggarwal >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adityaselfefficient at gmail.com Tue Jan 14 06:00:06 2020 From: adityaselfefficient at gmail.com (aditya aggarwal) Date: Tue, 14 Jan 2020 16:30:06 +0530 Subject: [scikit-learn] Decision tree call chronology Message-ID: Hello I am trying to understand the order of functions call for performing classification using decision tree in sklearn. I need to make and test some changes in the algorithm used to calculate best split for my dissertation. I have looked up the documentation available of sklearn and other sources available online but couldn't seem to crack it. Any help would be appreciated. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Tue Jan 14 06:35:24 2020 From: niourf at gmail.com (Nicolas Hug) Date: Tue, 14 Jan 2020 06:35:24 -0500 Subject: [scikit-learn] Decision tree call chronology In-Reply-To: References: Message-ID: Hi Aditya, It's hard for us to answer without any specific question. Perhaps this will help: https://scikit-learn.org/stable/developers/contributing.html#reading-the-existing-code-base The tree code is quite complex, because it is very generic and can support many different settings (multioutput, sparse data, etc) as well as many different parameters like max_features, splitter, presort... I would suggest being familiar with the different parameters before diving into the code. Nicolas On 1/14/20 6:00 AM, aditya aggarwal wrote: > Hello > > I am trying to understand the order of functions call for performing > classification using decision tree in sklearn. I need to make and test > some changes in the algorithm used to calculate best split for my > dissertation. I have looked up the documentation available of sklearn > and other sources available online but couldn't seem to crack it. Any > help would be appreciated. > > Thanks > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From fad469 at uregina.ca Tue Jan 14 10:28:48 2020 From: fad469 at uregina.ca (Farzana Anowar) Date: Tue, 14 Jan 2020 09:28:48 -0600 Subject: [scikit-learn] Attribute Incremental learning Message-ID: <48df7513ce06b5e21e112c83211c014f@uregina.ca> Hello, This is Farzana. I am trying to understand the attribute incremental learning ( or virtual concept drift) which is every time when a new feature will be available for a real-time dataset (i.e. any online auction dataset) a classifier will add that new feature with the existing features in a dataset and classify the new dataset (with previous features and new features) incrementally. I know that we can convert a static classifier to an incremental classifier in scikit-learn. However, I could not find any library or function for attribute incremental learning or any detail information. It would be great if anyone could give me some insight on this. Thanks! -- Best Regards, Farzana Anowar, PhD Candidate Department of Computer Science University of Regina From niourf at gmail.com Wed Jan 15 06:45:04 2020 From: niourf at gmail.com (Nicolas Hug) Date: Wed, 15 Jan 2020 06:45:04 -0500 Subject: [scikit-learn] Issues for Berlin and Paris Sprints In-Reply-To: References: Message-ID: Hi Chiara, Thanks for taking care of this > have a list of two/three reviewers available to check on a specific issue That might not be tractable in practice because we have a bunch of "bulk" issues involving many PRs, e.g. the issues about updating the random_state docs everywhere. But assigning reviewers to PRs should be feasible, let's try that. > a bit uncomfortable in pinging core-devs randomly FWIW, feel free to ping and assign me to PRs (not just sprint PRs) Nicolas On 1/6/20 5:13 AM, Chiara Marmo wrote: > Dear core-devs, > > First let me wish a Happy New Year to you all! > > There will be two scikit-learn sprints in January to start this 2020 > in a busy way: one in Berlin [1] (Jan 25) and one in Paris [2] (Jan > 28-31). > I feel like we could benefit of some coordination in selecting the > issues for those two events. > Reshama Shaikh and I, we are already in touch. > > I've opened two projects [3][4] to follow-up the issue selection for > the sprints. > > I will check for previous "Sprint" labels in the skl issues and maybe > ask for clarification on some of them... please, be patient. > The goal is to prepare the two sprints in order to make the review > process as efficient as possible: we don't want to waste the reviewer > time and we hope to make the PR experience a learning opportunity on > both sides. > > In particular, I would like to ask a favour to all of you: I don't > know if this is even always possible, but, IMO, it would be really > useful to have a list of two/three reviewers available to check on a > specific issue. I am, personally, a bit uncomfortable in pinging > core-devs randomly, under the impression of crying wolf lacking for > attention... If people in charge are defined in advance this could, I > think, smooth the review process. What do you think? > > Please, let us know if you have any suggestion or recommendation to > improve the Sprint organization. > > Thanks for listening, > Best, > Chiara > > [1] https://github.com/WiMLDS/berlin-2020-scikit-sprint > [2] > https://github.com/scikit-learn/scikit-learn/wiki/Paris-scikit-learn-Sprint-of-the-Decade > [3] https://github.com/WiMLDS/berlin-2020-scikit-sprint/projects/1 > [4] > https://github.com/scikit-learn-fondation/ParisSprintJanuary2020/projects/1 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From maxhalford25 at gmail.com Thu Jan 16 09:36:33 2020 From: maxhalford25 at gmail.com (Max Halford) Date: Thu, 16 Jan 2020 15:36:33 +0100 Subject: [scikit-learn] Attribute Incremental learning In-Reply-To: <48df7513ce06b5e21e112c83211c014f@uregina.ca> References: <48df7513ce06b5e21e112c83211c014f@uregina.ca> Message-ID: Hello Farzana, You might want to check out scikit-multiflow and creme (I'm the author). Kind regards. On Tue, 14 Jan 2020 at 16:59, Farzana Anowar wrote: > Hello, > > This is Farzana. I am trying to understand the attribute incremental > learning ( or virtual concept drift) which is every time when a new > feature will be available for a real-time dataset (i.e. any online > auction dataset) a classifier will add that new feature with the > existing features in a dataset and classify the new dataset (with > previous features and new features) incrementally. I know that we can > convert a static classifier to an incremental classifier in > scikit-learn. However, I could not find any library or function for > attribute incremental learning or any detail information. It would be > great if anyone could give me some insight on this. > > Thanks! > -- > Best Regards, > > Farzana Anowar, > PhD Candidate > Department of Computer Science > University of Regina > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Max Halford +336 28 25 13 38 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fad469 at uregina.ca Thu Jan 16 10:00:05 2020 From: fad469 at uregina.ca (Farzana Anowar) Date: Thu, 16 Jan 2020 09:00:05 -0600 Subject: [scikit-learn] Attribute Incremental learning In-Reply-To: References: <48df7513ce06b5e21e112c83211c014f@uregina.ca> Message-ID: <0aa41d66b218f025a6ee904f6d866d72@uregina.ca> On 2020-01-16 08:36, Max Halford wrote: > Hello Farzana, > > You might want to check out scikit-multiflow [1] and creme [2] (I'm > the author). > > Kind regards. > > On Tue, 14 Jan 2020 at 16:59, Farzana Anowar > wrote: > >> Hello, >> >> This is Farzana. I am trying to understand the attribute incremental >> >> learning ( or virtual concept drift) which is every time when a new >> feature will be available for a real-time dataset (i.e. any online >> auction dataset) a classifier will add that new feature with the >> existing features in a dataset and classify the new dataset (with >> previous features and new features) incrementally. I know that we >> can >> convert a static classifier to an incremental classifier in >> scikit-learn. However, I could not find any library or function for >> attribute incremental learning or any detail information. It would >> be >> great if anyone could give me some insight on this. >> >> Thanks! >> -- >> Best Regards, >> >> Farzana Anowar, >> PhD Candidate >> Department of Computer Science >> University of Regina >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > -- > > Max Halford > > +336 28 25 13 38 > > Links: > ------ > [1] https://scikit-multiflow.github.io/ > [2] https://creme-ml.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn Hello Max, Thanks a lot. -- Best Regards, Farzana Anowar, PhD Candidate Department of Computer Science University of Regina From garyfallidis at gmail.com Wed Jan 15 13:12:32 2020 From: garyfallidis at gmail.com (Eleftherios Garyfallidis) Date: Wed, 15 Jan 2020 13:12:32 -0500 Subject: [scikit-learn] ANN: DIPY 1.1.1 - a powerful release Message-ID: We are excited to announce a new release of DIPY: DIPY 1.1.1 is out! In addition: a) A new 5 day workshop available during March 16-20 to learn the theory and applications of the hundreds of methods available in DIPY 1.1.1 Intense! See the exquisite program here . *b) Given the need for a myriad of new DIPY derivative projects, DIPY moved to its own organization in GitHub. **Long live DIPY! * *And therefore, *https://github.com/dipy/dipy* supersedes https://github.com/nipy/dipy The old link will be available as a redirect link for the next 6 months.* c) Please support us by *citing** DIPY* in your papers using the following DOI: 10.3389/fninf.2014.00008 otherwise the DIPY citation police will find you. ;) DIPY 1.1.1 (Friday, 10 January 2020) This release received contributions from 11 developers (the full release notes are at: https://dipy.org/documentation/1.1.1./release_notes/release1.1/). Thank you all for your contributions and feedback! Please click here to check API changes. Highlights of this release include: - New module for deep learning DIPY.NN (uses TensorFlow 2.0). - Improved DKI performance and increased utilities. - Non-linear and RESTORE fits from DTI compatible now with DKI. - Numerical solutions for estimating axial, radial and mean kurtosis. - Added Kurtosis Fractional Anisotropy by Glenn et al. 2015. - Added Mean Kurtosis Tensor by Hansen et al. 2013. - Nibabel minimum version is 3.0.0. - Azure CI added and Appveyor CI removed. - New command line interfaces for LPCA, MPPCA and Gibbs Unringing. - New MTMS CSD tutorial added. - Horizon refactored and updated to support StatefulTractograms. - Speeded up all cython modules by using a smarter configuration setting. - All tutorials updated to API changes and 2 new tutorials added. - Large documentation update. - Closed 126 issues and merged 50 pull requests. Note: - Have in mind that DIPY stopped supporting Python 2 after version 0.16.0. All major Python projects have switched to Python 3. It is time that you switch too. To upgrade or install DIPY Run the following command in your terminal: pip install --upgrade dipy or conda install -c conda-forge dipy This version of DIPY depends on nibabel (3.0.0+). For visualization you need FURY (0.4.0+). Questions or suggestions? For any questions go to http://dipy.org, or send an e-mail to dipy at python.org We also have an instant messaging service and chat room available at https://gitter.im/nipy/dipy On behalf of the DIPY developers, Eleftherios Garyfallidis, Ariel Rokem, Serge Koudoro https://dipy.org/contributors -------------- next part -------------- An HTML attachment was scrubbed... URL: From marmochiaskl at gmail.com Fri Jan 17 03:45:16 2020 From: marmochiaskl at gmail.com (Chiara Marmo) Date: Fri, 17 Jan 2020 09:45:16 +0100 Subject: [scikit-learn] Issues for Berlin and Paris Sprints In-Reply-To: References: Message-ID: Hi Nicolas, thanks for your answer. have a list of two/three reviewers available to check on a specific issue > > That might not be tractable in practice because we have a bunch of "bulk" > issues involving many PRs, e.g. the issues about updating the random_state > docs everywhere. But assigning reviewers to PRs should be feasible, let's > try that. > Is that why now suggested reviewers are added to PR? I've googled and found this https://github.community/t5/How-to-use-Git-and-GitHub/Use-codeowners-to-suggest-reviewers-NOT-automatically-assign/td-p/11503 Is that the criterion used to populate suggested reviewers? Just trying to follow... :) Thanks, Chiara -------------- next part -------------- An HTML attachment was scrubbed... URL: From dstromberg at grokstream.com Fri Jan 17 15:38:50 2020 From: dstromberg at grokstream.com (Dan Stromberg) Date: Fri, 17 Jan 2020 12:38:50 -0800 Subject: [scikit-learn] Heisenbug? In-Reply-To: References: Message-ID: It's looking, at this point, like: 1) The NaN's are real 2) They're coming from some XGBoost native code, or perhaps a Python<->native boundary, which is interfacing using ctypes. The print's that didn't print were probably because of a misplaced flush. The debugger that didn't debug was probably because of pytest capturing stdout and async python code. Thanks. On Wed, Dec 18, 2019 at 4:09 PM Dan Stromberg wrote: > > Any (further) suggestions folks? > > BTW, when I say pudb fails to start, I mean it's tracebacking trying to > get None.fileno() In other pieces of (C)Python code I've tried it in, > pudb.set_trace() worked nicely. > > On Tue, Dec 17, 2019 at 7:50 AM Dan Stromberg > wrote: > >> >> Hi. >> >> Overflow does sound kind of possible. We're sending semi-random values >> to the test. >> >> I believe our systems are all x86_64, Linux. Some are Ubuntu 16.04, some >> are Mint 19.2. >> >> I realized on the way to work this morning, that I left out some >> important information; I suspect a heisenbug for 3 reasons: >> >> 1) If I try to look at it with print functions, I get a traceback after >> the print's, but no print output. This happens with both writing to a >> disk-based file, and with printing to stdout. >> >> 2) If I try to look at it with pudb (a debugger) via pudb.set_trace(), I >> get a failure to start pudb. >> >> 3) If I create a small test program that sends the same inputs to the >> function in question, the function works fine. >> >> Thanks. >> >> On Mon, Dec 16, 2019 at 11:20 PM Joel Nothman >> wrote: >> >>> Hi Dan, this kind of error can come from overflow. Are all of your test >>> systems the same architecture? >>> >>> On Tue., 17 Dec. 2019, 12:03 pm Dan Stromberg, < >>> dstromberg at grokstream.com> wrote: >>> >>>> Hi folks. >>>> >>>> I'm new to Scikit-learn. >>>> >>>> I have a very large Python project that seems to have a heisenbug which >>>> is manifesting in scikit-learn code. >>>> >>>> Short of constructing an SSCCE, are there any magical techniques I >>>> should try for pinning down the precise cause? Like valgrind or something? >>>> >>>> An SSCCE will most likely be pretty painful: the project has copious >>>> shared, mutable state, and I've already tried a largish test program that >>>> calls into the same code path with the error manifesting 0 times in 100. >>>> >>>> It's quite possible the root cause will turn out to be some other part >>>> of the software stack. >>>> >>>> The traceback from pytest looks like: >>>> sequential/test_training.py:101: >>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >>>> _ _ _ _ _ _ _ _ _ _ _ _ _ >>>> ../rt/classifier/coach.py:146: in train >>>> **self.classifier_section >>>> ../domain/classifier/factories/classifier_academy.py:115: in >>>> create_classifier >>>> **kwargs) >>>> ../domain/classifier/factories/imp/xgb_factory.py:164: in create >>>> clf_random.fit(X_train, y_train) >>>> ../../../../.local/lib/python3.6/site-packages/sklearn/model_selection/_search.py:722: >>>> in fit >>>> self._run_search(evaluate_candidates) >>>> ../../../../.local/lib/python3.6/site-packages/sklearn/model_selection/_search.py:1515: >>>> in _run_search >>>> random_state=self.random_state)) >>>> ../../../../.local/lib/python3.6/site-packages/sklearn/model_selection/_search.py:711: >>>> in evaluate_candidates >>>> cv.split(X, y, groups))) >>>> ../../../../.local/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py:996: >>>> in __call__ >>>> self.retrieve() >>>> ../../../../.local/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py:899: >>>> in retrieve >>>> self._output.extend(job.get(timeout=self.timeout)) >>>> ../../../../.local/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py:517: >>>> in wrap_future_result >>>> return future.result(timeout=timeout) >>>> /usr/lib/python3.6/concurrent/futures/_base.py:425: in result >>>> return self.__get_result() >>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ >>>> _ _ _ _ _ _ _ _ _ _ _ _ _ >>>> >>>> self = >>>> >>>> def __get_result(self): >>>> if self._exception: >>>> > raise self._exception >>>> E ValueError: Input contains NaN, infinity or a value too >>>> large for dtype('float32'). >>>> >>>> /usr/lib/python3.6/concurrent/futures/_base.py:384: ValueError >>>> >>>> >>>> The above exception is raised about 12 to 14 times in 100 in full-blown >>>> automated testing. >>>> >>>> Thanks for the cool software. >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Jan 18 08:09:46 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 19 Jan 2020 00:09:46 +1100 Subject: [scikit-learn] ANN: DIPY 1.1.1 - a powerful release In-Reply-To: References: Message-ID: If the Scikit-learn mailing list is going to include announcements of related package releases, could we please get a line or two describing that package? I expect most readers here don't know of DIPY, or of its relevance to Scikit-learn users. (I'm still not sure why it's generally relevant to scikit-learn users.) Thanks On Fri, 17 Jan 2020 at 04:04, Eleftherios Garyfallidis < garyfallidis at gmail.com> wrote: > We are excited to announce a new release of DIPY: > > > DIPY 1.1.1 is out! In addition: > > > a) A new 5 day workshop available during March 16-20 to learn the theory > and applications of the hundreds of methods available in DIPY 1.1.1 > Intense! > > See the exquisite program here . > > *b) Given the need for a myriad of new DIPY derivative projects, DIPY > moved to its own organization in GitHub. **Long live DIPY! * > *And therefore, *https://github.com/dipy/dipy* supersedes https://github.com/nipy/dipy > The old link will be available as a redirect > link for the next 6 months.* > > c) Please support us by *citing** DIPY* in your papers using the > following DOI: 10.3389/fninf.2014.00008 > otherwise the DIPY citation > police will find you. ;) > > DIPY 1.1.1 (Friday, 10 January 2020) > > This release received contributions from 11 developers (the full release > notes are at: > https://dipy.org/documentation/1.1.1./release_notes/release1.1/). Thank > you all for your contributions and feedback! > > Please click here to > check API changes. > > Highlights of this release include: > > - > > New module for deep learning DIPY.NN (uses TensorFlow 2.0). > - > > Improved DKI performance and increased utilities. > - > > Non-linear and RESTORE fits from DTI compatible now with DKI. > - > > Numerical solutions for estimating axial, radial and mean kurtosis. > - > > Added Kurtosis Fractional Anisotropy by Glenn et al. 2015. > - > > Added Mean Kurtosis Tensor by Hansen et al. 2013. > - > > Nibabel minimum version is 3.0.0. > - > > Azure CI added and Appveyor CI removed. > - > > New command line interfaces for LPCA, MPPCA and Gibbs Unringing. > - > > New MTMS CSD tutorial added. > - > > Horizon refactored and updated to support StatefulTractograms. > - > > Speeded up all cython modules by using a smarter configuration setting. > - > > All tutorials updated to API changes and 2 new tutorials added. > - > > Large documentation update. > - > > Closed 126 issues and merged 50 pull requests. > > Note: > > - > > Have in mind that DIPY stopped supporting Python 2 after version > 0.16.0. All major Python projects have switched to Python 3. It is time > that you switch too. > > > > To upgrade or install DIPY > > Run the following command in your terminal: > > > pip install --upgrade dipy > > or > > conda install -c conda-forge dipy > > This version of DIPY depends on nibabel (3.0.0+). > > For visualization you need FURY (0.4.0+). > > Questions or suggestions? > > > > For any questions go to http://dipy.org, or send an e-mail to > dipy at python.org > > We also have an instant messaging service and chat room available at > https://gitter.im/nipy/dipy > > On behalf of the DIPY developers, > > Eleftherios Garyfallidis, Ariel Rokem, Serge Koudoro > > https://dipy.org/contributors > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charujing123 at 163.com Mon Jan 20 06:15:50 2020 From: charujing123 at 163.com (Rujing Zha) Date: Mon, 20 Jan 2020 19:15:50 +0800 (CST) Subject: [scikit-learn] ask a question about weights for features in svc with rbf kernel Message-ID: <13ac58a7.68b5.16fc2aa7ac3.Coremail.charujing123@163.com> Hi experts and users, I am going to extact the pattern of svc. But I do not know how to extract weights for each feature using this svc classifiers with kernel of rbf function. Thank you. Rujing -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon Jan 20 07:30:53 2020 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Mon, 20 Jan 2020 13:30:53 +0100 Subject: [scikit-learn] ask a question about weights for features in svc with rbf kernel In-Reply-To: <13ac58a7.68b5.16fc2aa7ac3.Coremail.charujing123@163.com> Message-ID: An HTML attachment was scrubbed... URL: From charujing123 at 163.com Mon Jan 20 09:52:26 2020 From: charujing123 at 163.com (Rujing Zha) Date: Mon, 20 Jan 2020 22:52:26 +0800 (CST) Subject: [scikit-learn] ask a question about weights for features in svc with rbf kernel In-Reply-To: References: Message-ID: <45b1d5ed.85cc.16fc370cbe4.Coremail.charujing123@163.com> Hi Guillaume Is it OK for rbf kernel? As the document said: Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel. At 2020-01-20 20:30:53, "Guillaume Lema?tre" wrote: You can look at the attribute coef_ once your model is fitted. Sent from my phone - sorry to be brief and potential misspell. | From: charujing123 at 163.com Sent: 20 January 2020 12:32 To: scikit-learn at python.org Reply to: scikit-learn at python.org Subject: [scikit-learn] ask a question about weights for features in svc with rbf kernel | Hi experts and users, I am going to extact the pattern of svc. But I do not know how to extract weights for each feature using this svc classifiers with kernel of rbf function. Thank you. Rujing -------------- next part -------------- An HTML attachment was scrubbed... URL: From pehlivaniancharles at gmail.com Mon Jan 20 19:49:12 2020 From: pehlivaniancharles at gmail.com (Charles Pehlivanian) Date: Mon, 20 Jan 2020 19:49:12 -0500 Subject: [scikit-learn] Why is subset invariance necessary for transfom()? Message-ID: Not all data transformers have a transform method. For those that do, subset invariance is assumed as expressed in check_methods_subset_invariance(). It must be the case that T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic projections - PCA, kernel PCA, etc., but not for some manifold learning transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement of the data in space is a constrained optimization, may take into account the centroid of the dataset etc. The manifold learners have "batch" oos transform() methods that aren't implemented, and wouldn't pass that test. Instead, those that do - LocallyLinearEmbedding - use a pointwise version, essentially replacing a batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: for i in range(X.shape[0]): X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) Where to implement the batch transform() methods for MDS, SpectralEmbedding, LocallyLinearEmbedding, etc? Another verb? Both batch and pointwise versions? The latter is easy to implement once the batch version exists. Relax the test conditions? transform() is necessary for oos testing, so necessary for cross validation. The batch versions should be preferred, although as it stands, the pointwise versions are. Thanks Charles Pehlivanian -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Jan 20 21:24:52 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 21 Jan 2020 13:24:52 +1100 Subject: [scikit-learn] Why is subset invariance necessary for transfom()? In-Reply-To: References: Message-ID: I think allowing subset invariance to not hold is making stronger assumptions than we usually do about what it means to have a "test set". Having a transformation like this that relies on test set statistics implies that the test set is more than just selected samples, but rather that a large collection of samples is available at one time, and that it is in some sense sufficient or complete (no more samples are available that would give a better fit). So in a predictive modelling context you might have to set up your cross validation splits with this in mind. In terms of API, the subset invariance constraint allows us to assume that the transformation can be distributed or parallelized over samples. I'm not sure whether we have exploited that assumption within scikit-learn or whether related projects do so. I see the benefit of using such transformations in a prediction Pipeline, and really appreciate this challenge to our assumptions of what "transform" means. Joel On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, < pehlivaniancharles at gmail.com> wrote: > Not all data transformers have a transform method. For those that do, > subset invariance is assumed as expressed > in check_methods_subset_invariance(). It must be the case that > T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic > projections - PCA, kernel PCA, etc., but not for some manifold learning > transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement > of the data in space is a constrained optimization, may take into account > the centroid of the dataset etc. > > The manifold learners have "batch" oos transform() methods that aren't > implemented, and wouldn't pass that test. Instead, those that do - > LocallyLinearEmbedding - use a pointwise version, essentially replacing a > batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: > > for i in range(X.shape[0]): > X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) > > Where to implement the batch transform() methods for MDS, > SpectralEmbedding, LocallyLinearEmbedding, etc? > > Another verb? Both batch and pointwise versions? The latter is easy to > implement once the batch version exists. Relax the test conditions? > transform() is necessary for oos testing, so necessary for cross > validation. The batch versions should be preferred, although as it stands, > the pointwise versions are. > > Thanks > Charles Pehlivanian > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pehlivaniancharles at gmail.com Tue Jan 21 20:23:17 2020 From: pehlivaniancharles at gmail.com (Charles Pehlivanian) Date: Tue, 21 Jan 2020 20:23:17 -0500 Subject: [scikit-learn] Why is subset invariance necessary for transfom()? Message-ID: I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup... I also think that part of it relies on the notion of online vs. offline algorithm. For offline fits, a batch transform (non-subset invariant) is preferred. For a transformer that can only be used in an online sense, or is primarily used that way, keep the invariant. I see 3 options here - all I can say is that I don't vote for the first + No transform method on the manifold learners, so no cross-validation + Pointwise, distributable, subset-invariant, suboptimal greedy transform + Non-distributable, non-subset-invariant, optimal batch transform -Charles On Mon., Jan. 20, 21:24:52 2020 > wrote I think allowing subset invariance to not hold is making stronger assumptions than we usually do about what it means to have a "test set". Having a transformation like this that relies on test set statistics implies that the test set is more than just selected samples, but rather that a large collection of samples is available at one time, and that it is in some sense sufficient or complete (no more samples are available that would give a better fit). So in a predictive modelling context you might have to set up your cross validation splits with this in mind. In terms of API, the subset invariance constraint allows us to assume that the transformation can be distributed or parallelized over samples. I'm not sure whether we have exploited that assumption within scikit-learn or whether related projects do so. I see the benefit of using such transformations in a prediction Pipeline, and really appreciate this challenge to our assumptions of what "transform" means. Joel On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, > wrote: >* Not all data transformers have a transform method. For those that do, *>* subset invariance is assumed as expressed *>* in check_methods_subset_invariance(). It must be the case that *>* T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic *>* projections - PCA, kernel PCA, etc., but not for some manifold learning *>* transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement *>* of the data in space is a constrained optimization, may take into account *>* the centroid of the dataset etc. *>>* The manifold learners have "batch" oos transform() methods that aren't *>* implemented, and wouldn't pass that test. Instead, those that do - *>* LocallyLinearEmbedding - use a pointwise version, essentially replacing a *>* batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: *>>* for i in range(X.shape[0]): *>* X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) *>>* Where to implement the batch transform() methods for MDS, *>* SpectralEmbedding, LocallyLinearEmbedding, etc? *>>* Another verb? Both batch and pointwise versions? The latter is easy to *>* implement once the batch version exists. Relax the test conditions? *>* transform() is necessary for oos testing, so necessary for cross *>* validation. The batch versions should be preferred, although as it stands, *>* the pointwise versions are. *>>* Thanks *>* Charles Pehlivanian *>* _______________________________________________ *>* scikit-learn mailing list *>* scikit-learn at python.org *>* https://mail.python.org/mailman/listinfo/scikit-learn *>-------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Jan 21 20:28:14 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 21 Jan 2020 20:28:14 -0500 Subject: [scikit-learn] ask a question about weights for features in svc with rbf kernel In-Reply-To: <45b1d5ed.85cc.16fc370cbe4.Coremail.charujing123@163.com> References: <45b1d5ed.85cc.16fc370cbe4.Coremail.charujing123@163.com> Message-ID: <8f54a47e-54da-eeff-fff9-ee43fab40691@gmail.com> There is no coef_ for kernel SVMs. What exactly are you looking for? On 1/20/20 9:52 AM, Rujing Zha wrote: > Hi Guillaume > Is it OK for rbf kernel? As the document said:? ?Weights assigned to > the features (coefficients in the primal problem). This is only > available in the case of a*/linear kernel./* > > > > > At 2020-01-20 20:30:53, "Guillaume Lema?tre" > wrote: > > You can look at the attribute coef_ once your model is fitted. > > Sent from my phone - sorry to be brief and potential misspell. > > *From:* charujing123 at 163.com > *Sent:* 20 January 2020 12:32 > *To:* scikit-learn at python.org > *Reply to:* scikit-learn at python.org > *Subject:* [scikit-learn] ask a question about weights for > features in svc with rbf kernel > > > Hi experts and users, > I am going to extact the pattern of svc. But I do not know how to > extract weights for each feature using this svc classifiers with > kernel of rbf function. > Thank you. > Rujing > > > > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Jan 21 20:33:28 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 21 Jan 2020 20:33:28 -0500 Subject: [scikit-learn] Why is subset invariance necessary for transfom()? In-Reply-To: References: Message-ID: <9f6eaca1-dc90-6683-f6dc-cb312509e41d@gmail.com> On 1/21/20 8:23 PM, Charles Pehlivanian wrote: > I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup... > I also think that part of it relies on the notion of online vs. offline algorithm. For offline fits, a batch transform (non-subset invariant) is preferred. For a transformer that can only be used in an online sense, or is primarily used that way, keep the invariant. > I see 3 options here - all I can say is that I don't vote for the first > + No transform method on the manifold learners, so no cross-validation This is what I thought we usually do. It looks like you said we are doing a greedy transform. I'm not sure I follow that. In particular for spectral embedding for example there is a pretty way to describe the transform and that's what we're doing. You could also look at doing transductive learning but that's not really the standard formulation, is it? > + Pointwise, distributable, subset-invariant, suboptimal greedy transform > + Non-distributable, non-subset-invariant, optimal batch transform Can you give an example of that? > -Charles > On Mon., Jan. 20, 21:24:52 2020 > wrote > I think allowing subset invariance to not hold is making stronger > assumptions than we usually do about what it means to have a "test set". > Having a transformation like this that relies on test set statistics > implies that the test set is more than just selected samples, but rather > that a large collection of samples is available at one time, and that it is > in some sense sufficient or complete (no more samples are available that > would give a better fit). So in a predictive modelling context you might > have to set up your cross validation splits with this in mind. > > In terms of API, the subset invariance constraint allows us to assume that > the transformation can be distributed or parallelized over samples. I'm not > sure whether we have exploited that assumption within scikit-learn or > whether related projects do so. > > I see the benefit of using such transformations in a prediction Pipeline, > and really appreciate this challenge to our assumptions of what "transform" > means. > > Joel > > On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, < > pehlivaniancharles at gmail.com > wrote: > > >/Not all data transformers have a transform method. For those that do, />/subset invariance is assumed as expressed />/in check_methods_subset_invariance(). It must be the case that />/T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for > classic />/projections - PCA, kernel PCA, etc., but not for some manifold learning />/transformers - MDS, SpectralEmbedding, etc. For those, an optimal > placement />/of the data in space is a constrained optimization, may take into > account />/the centroid of the dataset etc. />//>/The manifold learners have "batch" oos transform() methods that aren't />/implemented, and wouldn't pass that test. Instead, those that do - />/LocallyLinearEmbedding - use a pointwise version, essentially > replacing a />/batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: />//>/for i in range(X.shape[0]): />/X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) />//>/Where to implement the batch transform() methods for MDS, />/SpectralEmbedding, LocallyLinearEmbedding, etc? />//>/Another verb? Both batch and pointwise versions? The latter is easy to />/implement once the batch version exists. Relax the test conditions? />/transform() is necessary for oos testing, so necessary for cross />/validation. The batch versions should be preferred, although as it > stands, />/the pointwise versions are. />//>/Thanks />/Charles Pehlivanian />/_______________________________________________ />/scikit-learn mailing list />/scikit-learn at python.org > />/https://mail.python.org/mailman/listinfo/scikit-learn />//-------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From pehlivaniancharles at gmail.com Tue Jan 21 21:19:45 2020 From: pehlivaniancharles at gmail.com (Charles Pehlivanian) Date: Tue, 21 Jan 2020 21:19:45 -0500 Subject: [scikit-learn] Why is subset invariance necessary for transfom()? In-Reply-To: <9f6eaca1-dc90-6683-f6dc-cb312509e41d@gmail.com> References: <9f6eaca1-dc90-6683-f6dc-cb312509e41d@gmail.com> Message-ID: This is what I thought we usually do. It looks like you said we are doing a greedy transform. I'm not sure I follow that. In particular for spectral embedding for example there is a pretty way to describe the transform and that's what we're doing. You could also look at doing transductive learning but that's not really the standard formulation, is it? Batch transform becomes greedy if one does: for x_i in X: X_new_i = self.transform(x_i) I said that LLE uses greedy algorithm. The algorithm implemented is pointwise. It may be that that's the only approach (in which case it's not greedy), but I don't think so - looks like all of the spectral embedding, lle, mds transforms have batch versions. So I probably shouldn't call it greedy. Taking a *true* batch transform and enclosing it in a loop like that - I'm calling that greedy. I'm honestly not sure if the LLE qualifies. Spectral embedding - agree, the method you refer to is implemented in fit_transform(). How to apply to oos points? Non-distributable, non-subset-invariant, optimal batch transform Can you give an example of that? Most of the manifold learners can be expressed as solutions to eigenvalue/vector problems. For MDS batch transform, form a new constrained double-centered distance matrix and solve a constrained least-squares problem that mimics the SVD solution to the eigenvalue problem. They're all like this - least-squares estimates for some constrained eigenvalue problem. The question is whether you want to solve the full problem, or solve on each point, adding one row and optimzing each time, ... that would be subset-invariant though. For this offline/batch approach to an oos transform, the only way I see to make it pass tests is to enclose it in a loop as above. That's what I see at least. On Tue, Jan 21, 2020 at 8:35 PM Andreas Mueller wrote: > > > On 1/21/20 8:23 PM, Charles Pehlivanian wrote: > > I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup... > > I also think that part of it relies on the notion of online vs. offline algorithm. For offline fits, a batch transform (non-subset invariant) is preferred. For a transformer that can only be used in an online sense, or is primarily used that way, keep the invariant. > > > I see 3 options here - all I can say is that I don't vote for the first > > + No transform method on the manifold learners, so no cross-validation > > This is what I thought we usually do. It looks like you said we are doing > a greedy transform. > I'm not sure I follow that. In particular for spectral embedding for > example there is a pretty way to describe > the transform and that's what we're doing. You could also look at doing > transductive learning but that's > not really the standard formulation, is it? > > + Pointwise, distributable, subset-invariant, suboptimal greedy transform > > + Non-distributable, non-subset-invariant, optimal batch transform > > Can you give an example of that? > > -Charles > > On Mon., Jan. 20, 21:24:52 2020 > wrote > > I think allowing subset invariance to not hold is making stronger > > assumptions than we usually do about what it means to have a "test set". > Having a transformation like this that relies on test set statistics > implies that the test set is more than just selected samples, but rather > that a large collection of samples is available at one time, and that it is > in some sense sufficient or complete (no more samples are available that > would give a better fit). So in a predictive modelling context you might > have to set up your cross validation splits with this in mind. > > In terms of API, the subset invariance constraint allows us to assume that > the transformation can be distributed or parallelized over samples. I'm not > sure whether we have exploited that assumption within scikit-learn or > whether related projects do so. > > I see the benefit of using such transformations in a prediction Pipeline, > and really appreciate this challenge to our assumptions of what "transform" > means. > > Joel > > On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, > wrote: > > >* Not all data transformers have a transform method. For those that do, > *>* subset invariance is assumed as expressed > *>* in check_methods_subset_invariance(). It must be the case that > *>* T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic > *>* projections - PCA, kernel PCA, etc., but not for some manifold learning > *>* transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement > *>* of the data in space is a constrained optimization, may take into account > *>* the centroid of the dataset etc. > *>>* The manifold learners have "batch" oos transform() methods that aren't > *>* implemented, and wouldn't pass that test. Instead, those that do - > *>* LocallyLinearEmbedding - use a pointwise version, essentially replacing a > *>* batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: > *>>* for i in range(X.shape[0]): > *>* X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) > *>>* Where to implement the batch transform() methods for MDS, > *>* SpectralEmbedding, LocallyLinearEmbedding, etc? > *>>* Another verb? Both batch and pointwise versions? The latter is easy to > *>* implement once the batch version exists. Relax the test conditions? > *>* transform() is necessary for oos testing, so necessary for cross > *>* validation. The batch versions should be preferred, although as it stands, > *>* the pointwise versions are. > *>>* Thanks > *>* Charles Pehlivanian > *>* _______________________________________________ > *>* scikit-learn mailing list > *>* scikit-learn at python.org > *>* https://mail.python.org/mailman/listinfo/scikit-learn > *>-------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Wed Jan 22 07:17:00 2020 From: ahowe42 at gmail.com (Andrew Howe) Date: Wed, 22 Jan 2020 12:17:00 +0000 Subject: [scikit-learn] ANN: DIPY 1.1.1 - a powerful release In-Reply-To: References: Message-ID: I was unaware of this package, and had to look it up. It's my opinion that his package is only relevant to a likely small subset of users engaged in computational neuroanatomy. I am not sure updates really belong on this list... Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD LinkedIn Profile ResearchGate Profile Open Researcher and Contributor ID (ORCID) Github Profile Personal Website I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Sat, Jan 18, 2020 at 1:12 PM Joel Nothman wrote: > If the Scikit-learn mailing list is going to include announcements of > related package releases, could we please get a line or two describing that > package? I expect most readers here don't know of DIPY, or of its relevance > to Scikit-learn users. (I'm still not sure why it's generally relevant to > scikit-learn users.) > > Thanks > > On Fri, 17 Jan 2020 at 04:04, Eleftherios Garyfallidis < > garyfallidis at gmail.com> wrote: > >> We are excited to announce a new release of DIPY: >> >> >> DIPY 1.1.1 is out! In addition: >> >> >> a) A new 5 day workshop available during March 16-20 to learn the theory >> and applications of the hundreds of methods available in DIPY 1.1.1 >> Intense! >> >> See the exquisite program here . >> >> *b) Given the need for a myriad of new DIPY derivative projects, DIPY >> moved to its own organization in GitHub. **Long live DIPY! * >> *And therefore, *https://github.com/dipy/dipy* supersedes https://github.com/nipy/dipy >> The old link will be available as a redirect >> link for the next 6 months.* >> >> c) Please support us by *citing** DIPY* in your papers using the >> following DOI: 10.3389/fninf.2014.00008 >> otherwise the DIPY >> citation police will find you. ;) >> >> DIPY 1.1.1 (Friday, 10 January 2020) >> >> This release received contributions from 11 developers (the full release >> notes are at: >> https://dipy.org/documentation/1.1.1./release_notes/release1.1/). Thank >> you all for your contributions and feedback! >> >> Please click here >> to check API changes. >> >> Highlights of this release include: >> >> - >> >> New module for deep learning DIPY.NN (uses TensorFlow 2.0). >> - >> >> Improved DKI performance and increased utilities. >> - >> >> Non-linear and RESTORE fits from DTI compatible now with DKI. >> - >> >> Numerical solutions for estimating axial, radial and mean kurtosis. >> - >> >> Added Kurtosis Fractional Anisotropy by Glenn et al. 2015. >> - >> >> Added Mean Kurtosis Tensor by Hansen et al. 2013. >> - >> >> Nibabel minimum version is 3.0.0. >> - >> >> Azure CI added and Appveyor CI removed. >> - >> >> New command line interfaces for LPCA, MPPCA and Gibbs Unringing. >> - >> >> New MTMS CSD tutorial added. >> - >> >> Horizon refactored and updated to support StatefulTractograms. >> - >> >> Speeded up all cython modules by using a smarter configuration >> setting. >> - >> >> All tutorials updated to API changes and 2 new tutorials added. >> - >> >> Large documentation update. >> - >> >> Closed 126 issues and merged 50 pull requests. >> >> Note: >> >> - >> >> Have in mind that DIPY stopped supporting Python 2 after version >> 0.16.0. All major Python projects have switched to Python 3. It is time >> that you switch too. >> >> >> >> To upgrade or install DIPY >> >> Run the following command in your terminal: >> >> >> pip install --upgrade dipy >> >> or >> >> conda install -c conda-forge dipy >> >> This version of DIPY depends on nibabel (3.0.0+). >> >> For visualization you need FURY (0.4.0+). >> >> Questions or suggestions? >> >> >> >> For any questions go to http://dipy.org, or send an e-mail to >> dipy at python.org >> >> We also have an instant messaging service and chat room available at >> https://gitter.im/nipy/dipy >> >> On behalf of the DIPY developers, >> >> Eleftherios Garyfallidis, Ariel Rokem, Serge Koudoro >> >> https://dipy.org/contributors >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From garyfallidis at gmail.com Wed Jan 22 10:54:04 2020 From: garyfallidis at gmail.com (Eleftherios Garyfallidis) Date: Wed, 22 Jan 2020 10:54:04 -0500 Subject: [scikit-learn] ANN: DIPY 1.1.1 - a powerful release In-Reply-To: References: Message-ID: Hello Joel, Here is the short description about DIPY as requested. DIPY is the paragon 3D/4D+ imaging library in Python. Contains generic methods for spatial normalization, signal processing, machine learning, statistical analysis and visualization of medical images. Additionally, it contains specialized methods for computational anatomy including diffusion, perfusion and structural imaging. Gael and Alex from sklearn's leadership team are aware of the project and its importance to the Pythonic community. Have in mind that there is an extremely low number of community based open source projects in medical imaging in Python and DIPY is one of the very few examples that is actually stable and growing. DIPY is quite unique because it provides methods (algorithms developed from scratch in Python) for solving medical imaging problems. Some of the algorithms are very generic for example we have image registration and denoising algorithms and can be used across fields. It is true that in our website it shows that focus is on diffusion imaging but now this is changing and we will be updating all our websites and systems accordingly to explain the more generic extent of the library during the following weeks. Your call at the end. It would be nice if you can spread the word. No hard feelings otherwise. Best, Eleftherios On Sat, Jan 18, 2020 at 8:12 AM Joel Nothman wrote: > If the Scikit-learn mailing list is going to include announcements of > related package releases, could we please get a line or two describing that > package? I expect most readers here don't know of DIPY, or of its relevance > to Scikit-learn users. (I'm still not sure why it's generally relevant to > scikit-learn users.) > > Thanks > > On Fri, 17 Jan 2020 at 04:04, Eleftherios Garyfallidis < > garyfallidis at gmail.com> wrote: > >> We are excited to announce a new release of DIPY: >> >> >> DIPY 1.1.1 is out! In addition: >> >> >> a) A new 5 day workshop available during March 16-20 to learn the theory >> and applications of the hundreds of methods available in DIPY 1.1.1 >> Intense! >> >> See the exquisite program here . >> >> *b) Given the need for a myriad of new DIPY derivative projects, DIPY >> moved to its own organization in GitHub. **Long live DIPY! * >> *And therefore, *https://github.com/dipy/dipy* supersedes https://github.com/nipy/dipy >> The old link will be available as a redirect >> link for the next 6 months.* >> >> c) Please support us by *citing** DIPY* in your papers using the >> following DOI: 10.3389/fninf.2014.00008 >> otherwise the DIPY >> citation police will find you. ;) >> >> DIPY 1.1.1 (Friday, 10 January 2020) >> >> This release received contributions from 11 developers (the full release >> notes are at: >> https://dipy.org/documentation/1.1.1./release_notes/release1.1/). Thank >> you all for your contributions and feedback! >> >> Please click here >> to check API changes. >> >> Highlights of this release include: >> >> - >> >> New module for deep learning DIPY.NN (uses TensorFlow 2.0). >> - >> >> Improved DKI performance and increased utilities. >> - >> >> Non-linear and RESTORE fits from DTI compatible now with DKI. >> - >> >> Numerical solutions for estimating axial, radial and mean kurtosis. >> - >> >> Added Kurtosis Fractional Anisotropy by Glenn et al. 2015. >> - >> >> Added Mean Kurtosis Tensor by Hansen et al. 2013. >> - >> >> Nibabel minimum version is 3.0.0. >> - >> >> Azure CI added and Appveyor CI removed. >> - >> >> New command line interfaces for LPCA, MPPCA and Gibbs Unringing. >> - >> >> New MTMS CSD tutorial added. >> - >> >> Horizon refactored and updated to support StatefulTractograms. >> - >> >> Speeded up all cython modules by using a smarter configuration >> setting. >> - >> >> All tutorials updated to API changes and 2 new tutorials added. >> - >> >> Large documentation update. >> - >> >> Closed 126 issues and merged 50 pull requests. >> >> Note: >> >> - >> >> Have in mind that DIPY stopped supporting Python 2 after version >> 0.16.0. All major Python projects have switched to Python 3. It is time >> that you switch too. >> >> >> >> To upgrade or install DIPY >> >> Run the following command in your terminal: >> >> >> pip install --upgrade dipy >> >> or >> >> conda install -c conda-forge dipy >> >> This version of DIPY depends on nibabel (3.0.0+). >> >> For visualization you need FURY (0.4.0+). >> >> Questions or suggestions? >> >> >> >> For any questions go to http://dipy.org, or send an e-mail to >> dipy at python.org >> >> We also have an instant messaging service and chat room available at >> https://gitter.im/nipy/dipy >> >> On behalf of the DIPY developers, >> >> Eleftherios Garyfallidis, Ariel Rokem, Serge Koudoro >> >> https://dipy.org/contributors >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From madhura.cj at gmail.com Thu Jan 23 00:31:39 2020 From: madhura.cj at gmail.com (Madhura Jayaratne) Date: Thu, 23 Jan 2020 16:31:39 +1100 Subject: [scikit-learn] Assigning reviewers to PRs Message-ID: Hi Scikit-learn team, I have submitted a PR implementing support for ICE plots at [1]. There I noticed that Github suggests reviewers based on the recent edits and reviews to the files changed in the PR. I am wondering what is the preferred practice for assigning reviewers. (Initially, I assigned Guillaume Lemaitre without thinking much, but he cannot review at the moment) I am wondering, should I assign someone from the suggested list or should I wait for the core developers to assign themselves? [1] https://github.com/scikit-learn/scikit-learn/pull/16164 -- Thanks and Regards, Madhura Jayaratne -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Thu Jan 23 11:41:42 2020 From: adrin.jalali at gmail.com (Adrin) Date: Thu, 23 Jan 2020 17:41:42 +0100 Subject: [scikit-learn] Assigning reviewers to PRs In-Reply-To: References: Message-ID: Hi Madhura, We don't really follow the assignment workflow. Whichever of the maintainers is free and feels like reviewing the PR will do so. But in general review time is our main bottleneck, and it may take some time for us to get on a PR. Patience and every now and then writing a comment under your PR saying "a gentle ping" or something would probably remind us that it's still not reviewed :) Thanks for contributing, Adrin. On Thu, Jan 23, 2020 at 6:33 AM Madhura Jayaratne wrote: > Hi Scikit-learn team, > > I have submitted a PR implementing support for ICE plots at [1]. There I > noticed that Github suggests reviewers based on the recent edits and > reviews to the files changed in the PR. I am wondering what is the > preferred practice for assigning reviewers. (Initially, I assigned Guillaume > Lemaitre without thinking much, but he cannot review at the moment) I am > wondering, should I assign someone from the suggested list or should I > wait for the core developers to assign themselves? > > [1] https://github.com/scikit-learn/scikit-learn/pull/16164 > > -- > Thanks and Regards, > > Madhura Jayaratne > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From madhura.cj at gmail.com Thu Jan 23 17:20:59 2020 From: madhura.cj at gmail.com (Madhura Jayaratne) Date: Fri, 24 Jan 2020 09:20:59 +1100 Subject: [scikit-learn] Assigning reviewers to PRs In-Reply-To: References: Message-ID: Thanks for the clarification. On Fri, Jan 24, 2020 at 3:42 AM Adrin wrote: > Hi Madhura, > > We don't really follow the assignment workflow. Whichever of the > maintainers is free and feels like reviewing the PR > will do so. But in general review time is our main bottleneck, and it may > take some time for us to get on a PR. > > Patience and every now and then writing a comment under your PR saying "a > gentle ping" or something > would probably remind us that it's still not reviewed :) > > Thanks for contributing, > Adrin. > > On Thu, Jan 23, 2020 at 6:33 AM Madhura Jayaratne > wrote: > >> Hi Scikit-learn team, >> >> I have submitted a PR implementing support for ICE plots at [1]. There I >> noticed that Github suggests reviewers based on the recent edits and >> reviews to the files changed in the PR. I am wondering what is the >> preferred practice for assigning reviewers. (Initially, I assigned Guillaume >> Lemaitre without thinking much, but he cannot review at the moment) I am >> wondering, should I assign someone from the suggested list or should I >> wait for the core developers to assign themselves? >> >> [1] https://github.com/scikit-learn/scikit-learn/pull/16164 >> >> -- >> Thanks and Regards, >> >> Madhura Jayaratne >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Thanks and Regards, Madhura Jayaratne -------------- next part -------------- An HTML attachment was scrubbed... URL: From pehlivaniancharles at gmail.com Sat Jan 25 10:52:36 2020 From: pehlivaniancharles at gmail.com (Charles Pehlivanian) Date: Sat, 25 Jan 2020 10:52:36 -0500 Subject: [scikit-learn] Why is subset invariance necessary for transfom()? In-Reply-To: References: <9f6eaca1-dc90-6683-f6dc-cb312509e41d@gmail.com> Message-ID: To summarize - for mds, spectralembedding, it looks like there is no transform method that will satisfy both 1. fit(X).transform(X) == fit_transform(X) 2. transform(X)[i:i+1] == transform(X[i:i+1]) that's because the current fit_transform doesn't factor nicely into those 2 steps. The last step returns a subset of eigenvalues of a modified Gram matrix. For PCA, kernel PCA, LLE, fit_transform is something like: center data, do U,S,V = SVD, project data onto submatrix of V. The last step is matrix multiplication. Last step in transform methods there are np.dot(...). That factors nicely. There could be a transform_batch method for mds which would satisfy 1., then transform could call transform_batch rowwise, to satisfy 2, but no single method will work. I don't know if there is appetite for separation and modification of unittests involved. Charles On Tue, Jan 21, 2020 at 9:19 PM Charles Pehlivanian < pehlivaniancharles at gmail.com> wrote: > This is what I thought we usually do. It looks like you said we are > doing a greedy transform. > I'm not sure I follow that. In particular for spectral embedding for > example there is a pretty way to describe > the transform and that's what we're doing. You could also look at > doing transductive learning but that's > not really the standard formulation, is it? > > Batch transform becomes greedy if one does: > > for x_i in X: > X_new_i = self.transform(x_i) > > I said that LLE uses greedy algorithm. The algorithm implemented is > pointwise. It may be that that's the only approach (in which case it's not > greedy), but I don't think so - looks like all of the spectral embedding, > lle, mds transforms have batch versions. So I probably shouldn't call it > greedy. Taking a *true* batch transform and enclosing it in a loop like > that - I'm calling that greedy. I'm honestly not sure if the LLE qualifies. > > Spectral embedding - agree, the method you refer to is implemented in > fit_transform(). How to apply to oos points? > > Non-distributable, non-subset-invariant, optimal batch transform > Can you give an example of that? > > Most of the manifold learners can be expressed as solutions to > eigenvalue/vector problems. For MDS batch transform, form a new constrained > double-centered distance matrix and solve a constrained least-squares > problem that mimics the SVD solution to the eigenvalue problem. They're > all like this - least-squares estimates for some constrained eigenvalue > problem. The question is whether you want to solve the full problem, or > solve on each point, adding one row and optimzing each time, ... that would > be subset-invariant though. > > For this offline/batch approach to an oos transform, the only way I see to > make it pass tests is to enclose it in a loop as above. That's what I see > at least. > > > On Tue, Jan 21, 2020 at 8:35 PM Andreas Mueller wrote: > >> >> >> On 1/21/20 8:23 PM, Charles Pehlivanian wrote: >> >> I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup... >> >> I also think that part of it relies on the notion of online vs. offline algorithm. For offline fits, a batch transform (non-subset invariant) is preferred. For a transformer that can only be used in an online sense, or is primarily used that way, keep the invariant. >> >> >> I see 3 options here - all I can say is that I don't vote for the first >> >> + No transform method on the manifold learners, so no cross-validation >> >> This is what I thought we usually do. It looks like you said we are doing >> a greedy transform. >> I'm not sure I follow that. In particular for spectral embedding for >> example there is a pretty way to describe >> the transform and that's what we're doing. You could also look at doing >> transductive learning but that's >> not really the standard formulation, is it? >> >> + Pointwise, distributable, subset-invariant, suboptimal greedy transform >> >> + Non-distributable, non-subset-invariant, optimal batch transform >> >> Can you give an example of that? >> >> -Charles >> >> On Mon., Jan. 20, 21:24:52 2020 > wrote >> >> I think allowing subset invariance to not hold is making stronger >> >> assumptions than we usually do about what it means to have a "test set". >> Having a transformation like this that relies on test set statistics >> implies that the test set is more than just selected samples, but rather >> that a large collection of samples is available at one time, and that it is >> in some sense sufficient or complete (no more samples are available that >> would give a better fit). So in a predictive modelling context you might >> have to set up your cross validation splits with this in mind. >> >> In terms of API, the subset invariance constraint allows us to assume that >> the transformation can be distributed or parallelized over samples. I'm not >> sure whether we have exploited that assumption within scikit-learn or >> whether related projects do so. >> >> I see the benefit of using such transformations in a prediction Pipeline, >> and really appreciate this challenge to our assumptions of what "transform" >> means. >> >> Joel >> >> On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, > wrote: >> >> >* Not all data transformers have a transform method. For those that do, >> *>* subset invariance is assumed as expressed >> *>* in check_methods_subset_invariance(). It must be the case that >> *>* T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic >> *>* projections - PCA, kernel PCA, etc., but not for some manifold learning >> *>* transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement >> *>* of the data in space is a constrained optimization, may take into account >> *>* the centroid of the dataset etc. >> *>>* The manifold learners have "batch" oos transform() methods that aren't >> *>* implemented, and wouldn't pass that test. Instead, those that do - >> *>* LocallyLinearEmbedding - use a pointwise version, essentially replacing a >> *>* batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: >> *>>* for i in range(X.shape[0]): >> *>* X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) >> *>>* Where to implement the batch transform() methods for MDS, >> *>* SpectralEmbedding, LocallyLinearEmbedding, etc? >> *>>* Another verb? Both batch and pointwise versions? The latter is easy to >> *>* implement once the batch version exists. Relax the test conditions? >> *>* transform() is necessary for oos testing, so necessary for cross >> *>* validation. The batch versions should be preferred, although as it stands, >> *>* the pointwise versions are. >> *>>* Thanks >> *>* Charles Pehlivanian >> *>* _______________________________________________ >> *>* scikit-learn mailing list >> *>* scikit-learn at python.org >> *>* https://mail.python.org/mailman/listinfo/scikit-learn >> *>-------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: >> >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pengyu.ut at gmail.com Mon Jan 27 15:30:44 2020 From: pengyu.ut at gmail.com (Peng Yu) Date: Mon, 27 Jan 2020 14:30:44 -0600 Subject: [scikit-learn] What are the stopwords used by CountVectorizer? In-Reply-To: References: Message-ID: Hi, I don't see what stopwords are used by CountVectorizer with stop_wordsstring = ?english?. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html Is there a way to figure it out? Thanks. -- Regards, Peng From jonathan.cusick09 at gmail.com Mon Jan 27 15:53:08 2020 From: jonathan.cusick09 at gmail.com (Jonathan Cusick) Date: Mon, 27 Jan 2020 15:53:08 -0500 Subject: [scikit-learn] What are the stopwords used by CountVectorizer? In-Reply-To: References: Message-ID: Hi Peng, I believe the set of English stop words used across all token vectorizers can be found in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py. Cheers, Jon On Mon, Jan 27, 2020 at 3:33 PM Peng Yu wrote: > Hi, > > I don't see what stopwords are used by CountVectorizer with > stop_wordsstring = ?english?. > > > https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html > > Is there a way to figure it out? Thanks. > > -- > Regards, > Peng > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christian.braune79 at gmail.com Mon Jan 27 15:55:15 2020 From: christian.braune79 at gmail.com (Christian Braune) Date: Mon, 27 Jan 2020 21:55:15 +0100 Subject: [scikit-learn] What are the stopwords used by CountVectorizer? In-Reply-To: References: Message-ID: Hi, https://github.com/scikit-learn/scikit-learn/blob/b194674c42d54b26137a456c510c5fdba1ba23e0/sklearn/feature_extraction/_stop_words.py Regards Christian Peng Yu schrieb am Mo., 27. Jan. 2020, 21:31: > Hi, > > I don't see what stopwords are used by CountVectorizer with > stop_wordsstring = ?english?. > > > https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html > > Is there a way to figure it out? Thanks. > > -- > Regards, > Peng > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Mon Jan 27 15:38:56 2020 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Mon, 27 Jan 2020 14:38:56 -0600 Subject: [scikit-learn] What are the stopwords used by CountVectorizer? In-Reply-To: References: Message-ID: <0806FEF7-C7D3-4D31-9EDD-8E349252D8AE@sebastianraschka.com> Hi Peng, check out https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py Best, Sebastian > On Jan 27, 2020, at 2:30 PM, Peng Yu wrote: > > Hi, > > I don't see what stopwords are used by CountVectorizer with > stop_wordsstring = ?english?. > > https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html > > Is there a way to figure it out? Thanks. > > -- > Regards, > Peng > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Mon Jan 27 17:32:58 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 28 Jan 2020 09:32:58 +1100 Subject: [scikit-learn] What are the stopwords used by CountVectorizer? In-Reply-To: <0806FEF7-C7D3-4D31-9EDD-8E349252D8AE@sebastianraschka.com> References: <0806FEF7-C7D3-4D31-9EDD-8E349252D8AE@sebastianraschka.com> Message-ID: See also https://www.aclweb.org/anthology/W18-2502/ for a critique of this and other stop word lists. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pengyu.ut at gmail.com Tue Jan 28 02:00:20 2020 From: pengyu.ut at gmail.com (Peng Yu) Date: Tue, 28 Jan 2020 01:00:20 -0600 Subject: [scikit-learn] Memory efficient TfidfVectorizer Message-ID: Hi, To use TfidfVectorizer, the whole corpus must be used into memory. This can be a problem for machines without a lot of memory. Is there a way to use only a small amount of memory by saving most intermediate results in the disk? Thanks. -- Regards, Peng From joel.nothman at gmail.com Tue Jan 28 05:19:47 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 28 Jan 2020 21:19:47 +1100 Subject: [scikit-learn] Memory efficient TfidfVectorizer In-Reply-To: References: Message-ID: Are you concerned about storing the whole corpus text in memory, or the whole corpus' statistics? If the text, use input='file' or input='filename' (or a generator of texts). On Tue, 28 Jan 2020 at 18:01, Peng Yu wrote: > Hi, > > To use TfidfVectorizer, the whole corpus must be used into memory. > This can be a problem for machines without a lot of memory. Is there a > way to use only a small amount of memory by saving most intermediate > results in the disk? Thanks. > > -- > Regards, > Peng > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pengyu.ut at gmail.com Tue Jan 28 06:26:34 2020 From: pengyu.ut at gmail.com (Peng Yu) Date: Tue, 28 Jan 2020 05:26:34 -0600 Subject: [scikit-learn] Memory efficient TfidfVectorizer In-Reply-To: References: Message-ID: > Are you concerned about storing the whole corpus text in memory, or the > whole corpus' statistics? If the text, use input='file' or input='filename' > (or a generator of texts). I am not really sure which stage takes the most memory as my program kills itself due to memory limitation. But I suspect it is the latter (whole corpus statistics) that takes the most memory? (I used 1<=ngram<=3). -- Regards, Peng From christopher.samiullah at protonmail.com Tue Jan 28 07:01:43 2020 From: christopher.samiullah at protonmail.com (Christopher.samiullah) Date: Tue, 28 Jan 2020 12:01:43 +0000 Subject: [scikit-learn] Recommended way of distributing persisted models so they work on different architectures In-Reply-To: <4JCxUGX0W0uwEDGJy0qyP3QyHpJy16dllG1DW2lAbNlP8Z4j7iV0XEAyTzdJp0-5OgSf3vjpxbcaWyCNnerydijNG1GGTXP9liLrtvXNLuw=@protonmail.com> References: <4JCxUGX0W0uwEDGJy0qyP3QyHpJy16dllG1DW2lAbNlP8Z4j7iV0XEAyTzdJp0-5OgSf3vjpxbcaWyCNnerydijNG1GGTXP9liLrtvXNLuw=@protonmail.com> Message-ID: Dear admins, > I recently encountered an issue attempting to load a model persisted via joblib dump on different Python architectures. I wrote up the issue here on stackoverflow: > https://stackoverflow.com/questions/59927368/how-to-distribute-sklearn-models-so-that-they-work-on-different-architectures?noredirect=1#59927368 > > I wondered if there was a recommended approach to mitigate this issue? I see there is sklearn-onyx (https://github.com/onnx/sklearn-onnx) is that something you would advise? > > Any help would be greatly appreciated. > > Kind regards, > Chris p.s. Apologies if this is a resend, I wasn't signed up to the list when I sent previously. > Sent with [ProtonMail](https://protonmail.com) Secure Email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Jan 28 07:12:05 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 28 Jan 2020 23:12:05 +1100 Subject: [scikit-learn] Recommended way of distributing persisted models so they work on different architectures In-Reply-To: References: <4JCxUGX0W0uwEDGJy0qyP3QyHpJy16dllG1DW2lAbNlP8Z4j7iV0XEAyTzdJp0-5OgSf3vjpxbcaWyCNnerydijNG1GGTXP9liLrtvXNLuw=@protonmail.com> Message-ID: Yes, ONNX is an appropriate solution when exporting models for prediction. See http://scikit-learn.org/stable/modules/model_persistence.html On Tue, 28 Jan 2020 at 23:03, Christopher.samiullah via scikit-learn < scikit-learn at python.org> wrote: > Dear admins, > > > I recently encountered an issue attempting to load a model persisted via > joblib dump on different Python architectures. I wrote up the issue here on > stackoverflow: > > https://stackoverflow.com/questions/59927368/how-to-distribute-sklearn-models-so-that-they-work-on-different-architectures?noredirect=1#59927368 > > I wondered if there was a recommended approach to mitigate this issue? I > see there is sklearn-onyx (https://github.com/onnx/sklearn-onnx) is that > something you would advise? > > Any help would be greatly appreciated. > > Kind regards, > Chris > > p.s. Apologies if this is a resend, I wasn't signed up to the list when I > sent previously. > > > > Sent with ProtonMail Secure Email. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pengyu.ut at gmail.com Tue Jan 28 09:47:13 2020 From: pengyu.ut at gmail.com (Peng Yu) Date: Tue, 28 Jan 2020 08:47:13 -0600 Subject: [scikit-learn] How to make sure stop words are matched when lowercase=False? Message-ID: Hi, https://github.com/scikit-learn/scikit-learn/blob/002f891a33b612be389d9c488699db5689753ef4/sklearn/feature_extraction/text.py#L587 The default of lowercase is True. But stopwords are lower case. Where is the code to make sure the stop words are removed when they are not in lower case? https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py -- Regards, Peng From joel.nothman at gmail.com Tue Jan 28 17:39:38 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 29 Jan 2020 09:39:38 +1100 Subject: [scikit-learn] How to make sure stop words are matched when lowercase=False? In-Reply-To: References: Message-ID: There is no such code. You need to make sure that the normalisation you use matches the normalisation applied when constructing a stop word list. Unfortunately we do not provide for this directly, and it is not easy to do so in the general case. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pengyu.ut at gmail.com Tue Jan 28 20:51:48 2020 From: pengyu.ut at gmail.com (Peng Yu) Date: Tue, 28 Jan 2020 19:51:48 -0600 Subject: [scikit-learn] Which sparse matrix should be use for fit? Message-ID: https://scikit-learn.org/stable/modules/svm.html Of the svm classes mentioned above, which sparse matrixes are appropriate to be used with them? https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix It is not very clear what matrix operations are used in fit(), so I can not tell what sparse matrixes should be used. Thanks. -- Regards, Peng From pengyu.ut at gmail.com Wed Jan 29 04:14:43 2020 From: pengyu.ut at gmail.com (Peng Yu) Date: Wed, 29 Jan 2020 03:14:43 -0600 Subject: [scikit-learn] Incremental generation of tf-idf matrix Message-ID: Hi, I seem that even if there is a slight change in the corpus, I have to run TfidfVectorizer on the whole corpus again. This can be time-consuming especially for large corpora. Is there a way to generate the tf-idf matrix incrementally so that if there is a slight change in the corpus, it will just take a little time instead of a lot of time to generate the tf-idf matrix? Thanks. -- Regards, Peng From g.lemaitre58 at gmail.com Wed Jan 29 04:58:35 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 29 Jan 2020 10:58:35 +0100 Subject: [scikit-learn] Which sparse matrix should be use for fit? In-Reply-To: References: Message-ID: Looking at check_array in the SVR and SVC, we convert to CSR format if the sparse matrices are not from this format: https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/svm/_base.py#L146 Basically, this is more efficient because we are going to make operation which will get row., In scikit-learn most predictor expect CSR apart of tree-based where CSC will be more efficient. CSC is also the format which is better for the preprocessing estimator (in general). Be aware that we are going to convert to the appropriate format if required. On Wed, 29 Jan 2020 at 02:54, Peng Yu wrote: > https://scikit-learn.org/stable/modules/svm.html > > Of the svm classes mentioned above, which sparse matrixes are > appropriate to be used with them? > > > https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix > > It is not very clear what matrix operations are used in fit(), so I > can not tell what sparse matrixes should be used. Thanks. > > -- > Regards, > Peng > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Wed Jan 29 04:59:30 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 29 Jan 2020 10:59:30 +0100 Subject: [scikit-learn] Which sparse matrix should be use for fit? In-Reply-To: References: Message-ID: if you could open an issue on GitHub, it would be great because this info would be useful in the docstring. On Wed, 29 Jan 2020 at 10:58, Guillaume Lema?tre wrote: > Looking at check_array in the SVR and SVC, we convert to CSR format if the > sparse matrices are not from this format: > > https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/svm/_base.py#L146 > > Basically, this is more efficient because we are going to make operation > which will get row., > > In scikit-learn most predictor expect CSR apart of tree-based where CSC > will be more efficient. CSC is also the format > which is better for the preprocessing estimator (in general). Be aware > that we are going to convert to the appropriate > format if required. > > On Wed, 29 Jan 2020 at 02:54, Peng Yu wrote: > >> https://scikit-learn.org/stable/modules/svm.html >> >> Of the svm classes mentioned above, which sparse matrixes are >> appropriate to be used with them? >> >> >> https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix >> >> It is not very clear what matrix operations are used in fit(), so I >> can not tell what sparse matrixes should be used. Thanks. >> >> -- >> Regards, >> Peng >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Jan 30 06:03:57 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 30 Jan 2020 12:03:57 +0100 Subject: [scikit-learn] SLEP011: Change of governance Message-ID: Dear all, I would like to propose a change of governance in the decision process to make it possible to retract a SLEP and to not escalate a mandatory TC vote. I open a SLEP where we can discuss about it: https://github.com/scikit-learn/enhancement_proposals/pull/28/files Cheers, -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From sina.mansour.lakouraj at gmail.com Thu Jan 30 19:09:00 2020 From: sina.mansour.lakouraj at gmail.com (Sina Mansour L.) Date: Fri, 31 Jan 2020 11:09:00 +1100 Subject: [scikit-learn] make_spd_matrix documentation Message-ID: Hi, I was trying to use the random positive definite matrix generator implemented in sklearn ( https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_spd_matrix.html). However, I noticed the documentation is very minimal, without any references on related research articles. I'm wondering if it generates a uniformly random covariance matrix to be used as a sampling method to generate a null distribution for covariances. I tried looking at the source code. But I didn't find any reasons or explanations explaining the method. In other words, I wanted to know if the random symmetric positive definite matrix returned is uniformly sampled from the space of all positive definite matrices or not. This is the idea I was interested in when I came across make_spd_matrix: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.594.3009&rep=rep1&type=pdf Kind regards, Sina -------------- next part -------------- An HTML attachment was scrubbed... URL: