From szx9404 at gmail.com Tue Dec 4 20:14:08 2018 From: szx9404 at gmail.com (parker x) Date: Tue, 4 Dec 2018 17:14:08 -0800 Subject: [scikit-learn] Question about contributing to scikit-learn Message-ID: Dear scikit-learn developers, My name is Parker, and I'm a data scientist. Scikit-learn is a great ML library that I work frequently for work and personal projects. I have always wanted to contribute something to the scikit-learn community, and I am wondering if you could give some opinions on following two ideas for contribution. My first idea is to integrate another python library 'imbalanced-learn' into scikit-learn so that people could also use scikit-learn to deal with imbalance issues. Another idea is to combine those scikit-learn built-in feature selection functions into one automated feature selection function that might benefit those users who are not familiar with feature selection process. Looking forward to your suggestions! And thank you very much for your time! Best, Parker -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Dec 5 17:32:06 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 6 Dec 2018 09:32:06 +1100 Subject: [scikit-learn] New core dev: Adrin Jalali Message-ID: The Scikit-learn core development team has welcomed a new member, Adrin Jalali, who has been doing some really amazing work in contributing code and reviews since July (aside from occasional contributions since 2014). Congratulations and welcome, Adrin! -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthieu.brucher at gmail.com Wed Dec 5 17:45:19 2018 From: matthieu.brucher at gmail.com (Matthieu Brucher) Date: Wed, 5 Dec 2018 22:45:19 +0000 Subject: [scikit-learn] Recurrent questions about speed for TfidfVectorizer In-Reply-To: References: <46dd3561-a70c-ea18-282f-26d34b87cf06@gmail.com> Message-ID: Hi qll, Sorry for the late reply, lots of things to work on currently. I'll have a look at the roadmap and the pointers to see what could be done to enhance the situation. Cheers, Matthieu Le lun. 26 nov. 2018 ? 20:09, Roman Yurchak via scikit-learn < scikit-learn at python.org> a ?crit : > Tries are interesting, but it appears that while they use less memory > that dicts/maps they are generally slower than dicts for a large number > of elements. See e.g. > https://github.com/pytries/marisa-trie/blob/master/docs/benchmarks.rst. > This is also consistent with the results in the below linked > CountVectorizer PR that aimed to use tries, I think. > > Though maybe e.g. MARISA-Trie (and generally trie libraries available in > python) did improve significantly in 5 years since > https://github.com/scikit-learn/scikit-learn/issues/2639 was done. > > The thing is also that even HashingVecorizer that doesn't need to handle > the vocabulary is only a moderately faster, so using a better data > structure for the vocabulary might give us its performance at best.. > > -- > Roman > > On 26/11/2018 16:f28, Andreas Mueller wrote: > > I think tries might be an interesting datastructure, but it really > > depends on where the bottleneck is. > > I'm really surprised they are not used more, but maybe that's just > > because implementations are missing? > > > > On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote: > >> Hi Matthieu, > >> > >> if you are interested in general questions regarding improving > >> scikit-learn performance, you might be want to have a look at the draft > >> roadmap > >> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- > >> there is a lot topics where suggestions / PRs on improving performance > >> would be very welcome. > >> > >> For the particular case of TfidfVectorizer, it is a bit different from > >> the rest of the scikit-learn code base in the sense that it's not > >> limited by the performance of numerical calculation but rather that of > >> string processing and counting. TfidfVectorizer is equivalent to > >> CountVectorizer + TfidfTransformer and the later has only a marginal > >> computational cost. As to CountVectorizer, last time I checked, its > >> profiling was something along the lines of, > >> - part regexp for tokenization (see token_pattern.findall) > >> - part token counting (see CountVectorizer._count_vocab) > >> - and a comparable part for all the rest > >> > >> Because of that, porting it to Cython is not that immediate, as one is > >> still going to use CPython regexp and token counting in a dict. For > >> instance, HashingVectorizer implements token counting in Cython -- it's > >> faster but not that much faster. Using C++ maps or some less common > >> structures have been discussed in > >> https://github.com/scikit-learn/scikit-learn/issues/2639 > >> > >> Currently, I think, there are ~3 main ways performance could be > improved, > >> 1. Optimize the current implementation while remaining in Python. > >> Possible but IMO would require some effort, because there are not much > >> low hanging fruits left there. Though a new look would definitely be > good. > >> > >> 2. Parallelize computations. There was some earlier discussion about > >> this in scikit-learn issues, but at present, the better way would > >> probably be to add it in dask-ml (see > >> https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already > >> supported. Someone would need to implement CountVectorizer. > >> > >> 3. Rewrite part of the implementation in a lower level language > (e.g. > >> Cython). The question is how maintainable that would be, and whether the > >> performance gains would be worth it. Now that Python 2 will be dropped, > >> at least not having to deal with Py2/3 compatibility for strings in > >> Cython might make things a bit easier. Though, if the processing is in > >> Cython it might also make using custom tokenizers/analyzers more > difficult. > >> > >> On a related topic, I have been experimenting with implementing > part > >> of this processing in Rust lately: > >> https://github.com/rth/text-vectorize. So far it looks promising. > >> Though, of course, it will remain a separate project because of language > >> constraints in scikit-learn. > >> > >> In general if you have thoughts on things that can be improved, don't > >> hesitate to open issues, > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Quantitative analyst, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Dec 5 22:13:03 2018 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 6 Dec 2018 11:13:03 +0800 Subject: [scikit-learn] Is there regression algo with 3-d input? Message-ID: I want to regress time series prediction per week, so the unit of train data X is the day ex: Mon, Tue, Wed...etc. Ex: train data X is like below X: [ [1,2,3,4,3,2,1] ,[2,2,3,4,3,2,2] ] Each data of each row is about the day of one week. So each row has 7 data. Now if I have another feature W in each day like weather, or traffic or else. I thought expand the X to 3d is reasonable because the W should be contained in each day in X. So what I thought X is: [ [ [1, W-Mon], [2, W-Tue] , [3, W-Wed] , [4, W-Thu] , [3, W-Fri] , [2, W-Sat] , [1, W-Sun] ] , [ [2, W-Mon], [2, W-Tue] , [3, W-Wed] , [4, W-Thu] , [3, W-Fri] , [2, W-Sat] , [2, W-Sun] ] ] It become a 3d input and contain every feature of each day. Does scikit have regression algo can accept the 3d input X ? I almost found algo can only accept 2d input X ex: *X* : array-like or sparse matrix, shape = [n_samples, n_features] -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Wed Dec 5 23:50:32 2018 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Wed, 5 Dec 2018 20:50:32 -0800 Subject: [scikit-learn] Is there regression algo with 3-d input? In-Reply-To: References: Message-ID: Would the output be different if you simply wrapped the whole process with reshaping 3D input to 2d? On Wed, Dec 5, 2018 at 7:14 PM lampahome wrote: > I want to regress time series prediction per week, so the unit of train > data X is the day ex: Mon, Tue, Wed...etc. > > Ex: train data X is like below > X: > [ [1,2,3,4,3,2,1] > ,[2,2,3,4,3,2,2] ] > Each data of each row is about the day of one week. So each row has 7 data. > > Now if I have another feature W in each day like weather, or traffic or > else. > > I thought expand the X to 3d is reasonable because the W should be > contained in each day in X. > > So what I thought X is: > [ [ [1, W-Mon], [2, W-Tue] , [3, W-Wed] , [4, W-Thu] , [3, W-Fri] , > [2, W-Sat] , [1, W-Sun] ] > , [ [2, W-Mon], [2, W-Tue] , [3, W-Wed] , [4, W-Thu] , [3, W-Fri] , > [2, W-Sat] , [2, W-Sun] ] ] > It become a 3d input and contain every feature of each day. > > Does scikit have regression algo can accept the 3d input X ? > I almost found algo can only accept 2d input X ex: *X* : array-like or > sparse matrix, shape = [n_samples, n_features] > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Dec 5 23:54:34 2018 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 6 Dec 2018 12:54:34 +0800 Subject: [scikit-learn] Is there regression algo with 3-d input? In-Reply-To: References: Message-ID: Stuart Reynolds ? 2018?12?6? ?? ??12:52??? > Would the output be different if you simply wrapped the whole process with > reshaping 3D input to 2d? > >> >> I don't know, I'm not experiencing on it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Thu Dec 6 05:53:39 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 6 Dec 2018 11:53:39 +0100 Subject: [scikit-learn] New core dev: Adrin Jalali In-Reply-To: References: Message-ID: Congrats and welcome Adrin! -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From sepand.haghighi at yahoo.com Thu Dec 6 12:59:16 2018 From: sepand.haghighi at yahoo.com (Sepand Haghighi) Date: Thu, 6 Dec 2018 17:59:16 +0000 (UTC) Subject: [scikit-learn] PyCM 1.6 released: New machine learning library for confusion matrix statistical analysis References: <124384675.2914980.1544119156218.ref@mail.yahoo.com> Message-ID: <124384675.2914980.1544119156218@mail.yahoo.com> Hi folks Recently we have released new version of PyCM,?library for confusion matrix statistical analysis. I thought you might find it interesting. PyCM is a multi-class confusion matrix library written in Python that supports both input data vectors and direct matrix, and a proper tool for post-classification model evaluation that supports most classes and overall statistics parameters. PyCM is the swiss-army knife of confusion matrices, targeted mainly at data scientists that need a broad array of metrics (more than 90) for predictive models and an accurate evaluation of large variety of classifiers. Version 1.6 changelog : - AUC Value Interpretation (AUCI) added - Example-6 added (Unbalanced data) - Anaconda cloud package added - overall_param and class_param arguments added to stat, save_stat and save_html methods - class_param argument added to save_csv method - _ removed from overall statistics names - README modified - Document modified Repository : https://github.com/sepandhaghighi/pycmWebsite : http://pycm.shaghighi.ir/Document : http://pycm.shaghighi.ir/doc/Paper Link :?PyCM: Multiclass confusion matrix library in Python Best RegardsSepand Haghighi -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Fri Dec 7 04:00:35 2018 From: pahome.chen at mirlab.org (lampahome) Date: Fri, 7 Dec 2018 17:00:35 +0800 Subject: [scikit-learn] Is there regression algo with 3-d input? In-Reply-To: References: Message-ID: Stuart Reynolds ? 2018?12?6? ?? ??12:52??? > Would the output be different if you simply wrapped the whole process with > reshaping 3D input to 2d? > > Sometimes will changed a lot, sometimes will be similar. Maybe using neural network is what I want? -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Dec 8 05:15:23 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 8 Dec 2018 21:15:23 +1100 Subject: [scikit-learn] Question about contributing to scikit-learn In-Reply-To: References: Message-ID: Hi Parker, We strongly urge new contributors to start with small issues (documentation, small fixes, etc.) to gain confidence in the contribution procedure, etc. Once you've worked on small issues and understand better what comes through the issue tracker, you can consider bigger contributions. We have indeed proposed support for imblearn-like Pipeline extensions ( https://github.com/scikit-learn/scikit-learn/issues/3855#issuecomment-357949997). And yes, we're in need of a contributor there, but I would rather review and merge smaller pieces of your work, before finding a large one that needs a lot of changes before merge. Joel On Wed, 5 Dec 2018 at 12:15, parker x wrote: > Dear scikit-learn developers, > > My name is Parker, and I'm a data scientist. > > Scikit-learn is a great ML library that I work frequently for work and > personal projects. I have always wanted to contribute something to the > scikit-learn community, and I am wondering if you could give some opinions > on following two ideas for contribution. > > My first idea is to integrate another python library 'imbalanced-learn' > into scikit-learn so that people could also use scikit-learn to deal with > imbalance issues. > > Another idea is to combine those scikit-learn built-in feature selection > functions into one automated feature selection function that might benefit > those users who are not familiar with feature selection process. > > Looking forward to your suggestions! And thank you very much for your time! > > Best, > Parker > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sat Dec 8 09:26:15 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Sat, 8 Dec 2018 09:26:15 -0500 Subject: [scikit-learn] New core dev: Adrin Jalali In-Reply-To: References: Message-ID: Congratulations and welcome Adrin! On 12/5/18 5:32 PM, Joel Nothman wrote: > The Scikit-learn core development team has welcomed a new member, > Adrin Jalali, who has been doing some really amazing work in > contributing code and reviews since July (aside from occasional > contributions since 2014). Congratulations and welcome, Adrin! > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sat Dec 8 12:16:05 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 8 Dec 2018 18:16:05 +0100 Subject: [scikit-learn] New core dev: Adrin Jalali In-Reply-To: References: Message-ID: <20181208171605.lxyoztlfk56zalrp@phare.normalesup.org> Indeed, welcome Adrin, and thanks a lot for your investment on the package! Ga?l On Sat, Dec 08, 2018 at 09:26:15AM -0500, Andreas Mueller wrote: > Congratulations and welcome Adrin! > On 12/5/18 5:32 PM, Joel Nothman wrote: > The Scikit-learn core development team has welcomed a new member, Adrin > Jalali, who has been doing some really amazing work in contributing code > and reviews since July (aside from occasional contributions since 2014). > Congratulations and welcome, Adrin! > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From jcrudy at gmail.com Sat Dec 8 13:16:46 2018 From: jcrudy at gmail.com (Jason Rudy) Date: Sat, 8 Dec 2018 10:16:46 -0800 Subject: [scikit-learn] check_estimator and score_samples method Message-ID: Hi all, I'm working on updating py-earth for some recent changes in scikit-learn and cython. It seems like check_estimator has been significantly improved, and I'm working through making py-earth compliant with it. I've hit the following issue, though. It seems check_estimator tests score_samples using only X as an argument, and py-earth's score_samples requires y as well. So, my question is: must score_samples work with just X (and therefore maybe I should just remove it from py-earth) or is it okay to have a score_samples that requires y, and I should try to find a workaround for check_estimator? Best, Jason -------------- next part -------------- An HTML attachment was scrubbed... URL: From qinhanmin2005 at sina.com Sat Dec 8 22:47:50 2018 From: qinhanmin2005 at sina.com (Hanmin Qin) Date: Sun, 09 Dec 2018 11:47:50 +0800 Subject: [scikit-learn] New core dev: Adrin Jalali Message-ID: <20181209034750.17491464009F@webmail.sinamail.sina.com.cn> Welcome and thanks for contributing! Hanmin Qin ----- Original Message ----- From: Gael Varoquaux To: Scikit-learn mailing list Subject: Re: [scikit-learn] New core dev: Adrin Jalali Date: 2018-12-09 01:18 Indeed, welcome Adrin, and thanks a lot for your investment on the package! Ga?l On Sat, Dec 08, 2018 at 09:26:15AM -0500, Andreas Mueller wrote: > Congratulations and welcome Adrin! > On 12/5/18 5:32 PM, Joel Nothman wrote: > The Scikit-learn core development team has welcomed a new member, Adrin > Jalali, who has been doing some really amazing work in contributing code > and reviews since July (aside from occasional contributions since 2014). > Congratulations and welcome, Adrin! > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Sun Dec 9 04:09:29 2018 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Sun, 09 Dec 2018 10:09:29 +0100 Subject: [scikit-learn] New core dev: Adrin Jalali In-Reply-To: <20181209034750.17491464009F@webmail.sinamail.sina.com.cn> Message-ID: <7bmvf8ethadfrfvqfkiadmcp.1544346569794@gmail.com> An HTML attachment was scrubbed... URL: From emmanuelarias30 at gmail.com Sun Dec 9 09:15:13 2018 From: emmanuelarias30 at gmail.com (eamanu15) Date: Sun, 9 Dec 2018 11:15:13 -0300 Subject: [scikit-learn] Question about contributing to scikit-learn In-Reply-To: References: Message-ID: Hello Parker, I can tell you my experience. I start to contribute to sklearn two month ago, and I start with code review, this way I can learn how sklearn is written and how is the workflow, read issue and try to solve them. Then, I make some PR. I can tell that the core devs are very friendly and help you always. Specially, I had more contact with Joel Nothman and Andreas Mueller (thanks guys). So, I hope this help you in some way =) Regards! Emmanuel -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Sun Dec 9 10:12:45 2018 From: adrin.jalali at gmail.com (Adrin) Date: Sun, 9 Dec 2018 16:12:45 +0100 Subject: [scikit-learn] New core dev: Adrin Jalali In-Reply-To: <7bmvf8ethadfrfvqfkiadmcp.1544346569794@gmail.com> References: <20181209034750.17491464009F@webmail.sinamail.sina.com.cn> <7bmvf8ethadfrfvqfkiadmcp.1544346569794@gmail.com> Message-ID: Thank you all for all the support, patience, and mentorship you've had and now having me on board. It's an absolute pleasure working with you :) On Sun, 9 Dec 2018 at 10:10 Guillaume Lema?tre wrote: > Congrats Adrin > > Sent from my phone - sorry to be brief and potential misspell. > *From:* qinhanmin2005 at sina.com > *Sent:* 9 December 2018 04:50 > *To:* scikit-learn at python.org > *Reply to:* qinhanmin2005 at sina.com; scikit-learn at python.org > *Subject:* Re: [scikit-learn] New core dev: Adrin Jalali > Welcome and thanks for contributing! > > Hanmin Qin > > ----- Original Message ----- > From: Gael Varoquaux > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] New core dev: Adrin Jalali > Date: 2018-12-09 01:18 > > > Indeed, welcome Adrin, and thanks a lot for your investment on the > package! > Ga?l > On Sat, Dec 08, 2018 at 09:26:15AM -0500, Andreas Mueller wrote: > > Congratulations and welcome Adrin! > > On 12/5/18 5:32 PM, Joel Nothman wrote: > > The Scikit-learn core development team has welcomed a new member, Adrin > > Jalali, who has been doing some really amazing work in contributing code > > and reviews since July (aside from occasional contributions since 2014). > > Congratulations and welcome, Adrin! > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 <+33169087968> > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jk231092 at gmail.com Mon Dec 10 00:02:46 2018 From: jk231092 at gmail.com (Jitesh Khandelwal) Date: Mon, 10 Dec 2018 10:32:46 +0530 Subject: [scikit-learn] Agglomerative clustering Message-ID: Hi everyone, I am using agglomerative clustering with an L1 distance matrix as input and the "complete" linkage option. I want to impose an additional constraint. When 2 clusters are combined and the cost of combination is equal for multiple cluster pairs, I want to choose the pair for which the combined cluster has the least size. What is the cleanest and easiest way of achieving this? Thanks, Jitesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Dec 10 00:58:52 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 10 Dec 2018 06:58:52 +0100 Subject: [scikit-learn] Agglomerative clustering In-Reply-To: References: Message-ID: <20181210055852.tiyx7fa3eq4n277i@phare.normalesup.org> > I want to impose an additional constraint. When 2 clusters are combined and the > cost of combination is equal for multiple cluster pairs, I want to choose the > pair for which the combined cluster has the least size. > What is the cleanest and easiest way of achieving this? I don't think that the public API enables you to do that. So I think that you are going to have to modify the code, and modify the cost heapq to make it a tuple of "(distance, size)". Unfortunately, when doing this, you'll be on your own, as we cannot provide support for modified code. Cheers, Ga?l From szx9404 at gmail.com Mon Dec 10 13:00:49 2018 From: szx9404 at gmail.com (parker x) Date: Mon, 10 Dec 2018 10:00:49 -0800 Subject: [scikit-learn] Question about contributing to scikit-learn In-Reply-To: References: Message-ID: Hi Emmanuel and Joel, Thanks very much for your advice. I will take a look at small issues first and see what to contribute from there. Best, Parker eamanu15 ?2018?12?9??? ??6:17??? > Hello Parker, > > I can tell you my experience. > > I start to contribute to sklearn two month ago, and I start with code > review, this way I can learn how sklearn is written and how is the > workflow, read issue and try to solve them. Then, I make some PR. > > I can tell that the core devs are very friendly and help you always. > Specially, I had more contact with Joel Nothman and Andreas Mueller (thanks > guys). > > So, I hope this help you in some way =) > > Regards! > Emmanuel > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Dec 10 17:57:19 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 11 Dec 2018 09:57:19 +1100 Subject: [scikit-learn] check_estimator and score_samples method In-Reply-To: References: Message-ID: We're trying to make check_estimator more flexible ( https://github.com/scikit-learn/scikit-learn/pull/8022) but this is certainly not something we had considered yet. Perhaps suggest it there? Or for now we could just make the check pass if score_samples yields a TypeError with only X... -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Tue Dec 11 04:09:40 2018 From: pahome.chen at mirlab.org (lampahome) Date: Tue, 11 Dec 2018 17:09:40 +0800 Subject: [scikit-learn] Why some regression algo can predict multiple out? Message-ID: As title, apart from sklearn.multioutput.MultiOutputRegressor, almost regression algo in sklearn only can predict 1-d output. Ex: predict 1-d output sklearn.linear_model.SGDRegressor fit(X, y, coef_init=None, intercept_init=None, sample_weight=None) y : numpy array, shape (n_samples,) Ex: predict multiple output sklearn.linear_model.ElasticNet fit(X, y, check_input=True) y : ndarray, shape (n_samples,) or (n_samples, n_targets) There're two kind of output for regression methods. What's the difference? -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Dec 11 04:54:47 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 11 Dec 2018 20:54:47 +1100 Subject: [scikit-learn] Why some regression algo can predict multiple out? In-Reply-To: References: Message-ID: Yes, some can use a shared model to predict multiple outputs (ElasticNet, DecisionTreeRegressor, MLPRegressor), others can't. Those that can't can be trivially extended to the multiple output case with MultiOutputRegressor, by learning each output independently. On Tue, 11 Dec 2018 at 20:11, lampahome wrote: > As title, apart from sklearn.multioutput.MultiOutputRegressor, almost > regression algo in sklearn only can predict 1-d output. > > Ex: predict 1-d output > sklearn.linear_model.SGDRegressor > fit(X, y, coef_init=None, intercept_init=None, sample_weight=None) > y : numpy array, shape (n_samples,) > > Ex: predict multiple output > sklearn.linear_model.ElasticNet > fit(X, y, check_input=True) > y : ndarray, shape (n_samples,) or (n_samples, n_targets) > > There're two kind of output for regression methods. > > What's the difference? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Tue Dec 11 06:03:44 2018 From: pahome.chen at mirlab.org (lampahome) Date: Tue, 11 Dec 2018 19:03:44 +0800 Subject: [scikit-learn] Why some regression algo can predict multiple out? In-Reply-To: References: Message-ID: Joel Nothman ? 2018?12?11? ?? ??5:56??? > Yes, some can use a shared model to predict multiple outputs (ElasticNet, > DecisionTreeRegressor, MLPRegressor), others can't. Those that can't can be > trivially extended to the multiple output case with MultiOutputRegressor, > by learning each output independently. > > I mean, why those(ElasticNet, DecisionTreeRegressor, MLPRegressor) can predict multiple outputs? What's the theory? thx lot. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Dec 12 21:40:31 2018 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 13 Dec 2018 10:40:31 +0800 Subject: [scikit-learn] Difference between linear model and tree-based regressor? Message-ID: Linear model like linear reg, Lasso reg, Elastic net reg...etc. Tree-based like ExtTree reg, Random forest reg...etc What's the difference between them? I observe one point is below: 1. linear model can be extrapolated? tree-based can't does it -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu Dec 13 04:16:28 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 13 Dec 2018 10:16:28 +0100 Subject: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories In-Reply-To: References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> <20181121153424.i3b7orguqhm243el@phare.normalesup.org> <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com> <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org> Message-ID: Hi all, I finally had some time to start looking at it the last days. Some preliminary work can be found here: https://github.com/jorisvandenbossche/target-encoder-benchmarks. Up to now, I only did some preliminary work to set up the benchmarks (based on Patricio Cerda's code, https://arxiv.org/pdf/1806.00979.pdf), and with some initial datasets (medical charges and employee salaries) compared the different implementations with its default settings. So there is still a lot to do (add datasets, investigate the actual differences between the different implementations and results, in a more structured way compare the options, etc, there are some todo's listed in the README). However, now I am mostly on holidays for the rest of December. If somebody wants to further look at it, that is certainly welcome, otherwise, it will be a priority for me beginning of January. For datasets: additional ideas are welcome. For now, the idea is to add a subset of the Criteo Terabyte Click dataset, and to generate some data. >>> Does that mean you'd be opposed to adding the leave-one-out TargetEncoder >>> I would really like to add it before February >> A few month to get it right is not that bad, is it? > The PR is over a year old already, and you hadn't voiced any opposition > there. As far as I understand, the open PR is not a leave-one-out TargetEncoder? I also did not yet add the CountFeaturizer from that scikit-learn PR, because it is actually quite different (e.g it doesn't work for regression tasks, as it counts conditional on y). But for classification it could be easily added to the benchmarks. Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Thu Dec 13 09:53:15 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 13 Dec 2018 15:53:15 +0100 Subject: [scikit-learn] Difference between linear model and tree-based regressor? In-Reply-To: References: Message-ID: They are very different statistical models from a mathematical point of view. See the online scikit-learn documentation or reference text books such as "Elements of Statistical Learning" for more details. In practice, linear model tends to be faster to fit on large data, especially when the number of features is large (although it depends on the solver, loss, penalty, data scaling...). Linear model cannot fit prediction tasks when the data is not linearly separable (by definition) while tree based model do not have this restriction. Tree based model can still under fit in some cases but for different reasons (e.g. when we limit the depth of the trees). Linear model can be made mode expressive via feature engineering (e.g. k-bins discretizer, polynomial features expansion, Nystroem kernel approximation...) and thereafter sometimes be competitive with tree based models even on task that where originally non linearly separable tasks. However this is not guaranteed either. Cross-validation and parameter tuning are still required to tell which class of model works best for a specific task. As you said, tree based model "cannot extrapolate" in the sense that their decision function is piecewise constant while the decision function of linear model is an hyperplane. Depending on the tasks the lack of extrapolation can either be considered a limitation or a benefit (for instance to avoid unrealistic extrapolations like people with a negative age or size, predicting negative mechanical energy loss via heat dissipation, fractions that are larger than 100%, 6 stars out of 5 recommendations...). -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Thu Dec 13 09:58:28 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Thu, 13 Dec 2018 23:58:28 +0900 Subject: [scikit-learn] Difference between linear model and tree-based regressor? In-Reply-To: References: Message-ID: "Elements of Statistical Learning" is on my bookshelf, but even so, that was a great summary! J.B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcrudy at gmail.com Thu Dec 13 16:06:41 2018 From: jcrudy at gmail.com (Jason Rudy) Date: Thu, 13 Dec 2018 13:06:41 -0800 Subject: [scikit-learn] check_estimator and score_samples method In-Reply-To: References: Message-ID: Thanks, Joel. From your response I assume that the use of a y argument to score_samples is not a violation of the sklearn API, so I'll keep the method and find a workaround for the check_estimator test as it's currently written. I'll comment on the issue as well. On Mon, Dec 10, 2018 at 2:58 PM Joel Nothman wrote: > We're trying to make check_estimator more flexible ( > https://github.com/scikit-learn/scikit-learn/pull/8022) but this is > certainly not something we had considered yet. Perhaps suggest it there? > > Or for now we could just make the check pass if score_samples yields a > TypeError with only X... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Dec 14 10:46:10 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 14 Dec 2018 10:46:10 -0500 Subject: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories In-Reply-To: References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> <20181121153424.i3b7orguqhm243el@phare.normalesup.org> <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com> <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org> Message-ID: <26d9146b-f673-ba0e-11d6-4266bec48407@gmail.com> On 12/13/18 4:16 AM, Joris Van den Bossche wrote: > Hi all, > > I finally had some time to start looking at it the last days. Some > preliminary work can be found here: > https://github.com/jorisvandenbossche/target-encoder-benchmarks. You continue to be my hero. Probably can not look at it in detail before the holidays though :-/ > > Up to now, I only did some preliminary work to set up the benchmarks > (based on Patricio Cerda's code, > https://arxiv.org/pdf/1806.00979.pdf), and with some initial datasets > (medical charges and employee salaries) compared the different > implementations with its default settings. > So there is still a lot to do (add datasets, investigate the actual > differences between the different implementations and results, in a > more structured way compare the options, etc, there are some todo's > listed in the README). However, now I am mostly on holidays for the > rest of December. If somebody wants to further look at it, that is > certainly welcome, otherwise, it will be a priority for me beginning > of January. > > For datasets: additional ideas are welcome. For now, the idea is to > add a subset of the Criteo Terabyte Click dataset, and to generate > some data. > > >>> Does that mean you'd be opposed to adding the leave-one-out TargetEncoder > >>> I would really like to add it before February > >> A few month to get it right is not that bad, is it? > > The PR is over a year old already, and you hadn't voiced any opposition > > there. > > As far as I understand, the open PR is not a leave-one-out TargetEncoder? I would want it to be :-/ > I also did not yet add the CountFeaturizer from that scikit-learn PR, > because it is actually quite different (e.g it doesn't work for > regression tasks, as it counts conditional on y). But for > classification it could be easily added to the benchmarks. I'm confused now. That's what TargetEncoder and leave-one-out TargetEncoder do as well, right? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Sat Dec 15 07:35:54 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sat, 15 Dec 2018 13:35:54 +0100 Subject: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories In-Reply-To: <26d9146b-f673-ba0e-11d6-4266bec48407@gmail.com> References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> <20181121153424.i3b7orguqhm243el@phare.normalesup.org> <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com> <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org> <26d9146b-f673-ba0e-11d6-4266bec48407@gmail.com> Message-ID: Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller : > As far as I understand, the open PR is not a leave-one-out TargetEncoder? > > I would want it to be :-/ > > I also did not yet add the CountFeaturizer from that scikit-learn PR, > because it is actually quite different (e.g it doesn't work for regression > tasks, as it counts conditional on y). But for classification it could be > easily added to the benchmarks. > > I'm confused now. That's what TargetEncoder and leave-one-out > TargetEncoder do as well, right?. > As far as I understand, that is not exactly what those do. The TargetEncoder (as implemented in dirty_cat, category_encoders and hccEncoders) will, for each category, calculate the expected value of the target depending on the category. For binary classification this indeed comes to counting the 0's and 1's, and there the information contained in the result might be similar as the sklearn PR, but the format is different: those packages calculate the probability (value between 0 and 1 as number of 1's divided by number of samples in that category) and return that as a single column, instead of returning two columns with the counts for the 0's and 1's. And for regression this is not related to counting anymore, but just the average of the target per category (in practice, the TargetEncoder is computing the same for regression or binary classification: the average of the target per category. But for regression, the CountFeaturizer doesn't work since there are no discrete values in the target to count). Furthermore, all of those implementations in the 3 mentioned packages have some kind of regularization (empirical bayes shrinkage, or KFold or leave-one-out cross-validation), while this is also not present in the CountFeaturizer PR (but this aspect is of course something we want to actually test in the benchmarks). Another thing I noticed in the CountFeaturizer implementation, is that the behaviour differs when y is passed or not. First, I find it a bit strange to do this as it is a quite different behaviour (counting the categories (to just encode the categorical variable with a notion about its frequency in the training set), or counting the target depending on the category is quite different?). But also, when using a transformer in a Pipeline, you don't control the passing of y, I think? So in that way, you always have the behaviour of counting the target. I would find it more logical to have those two things in two separate transformers (if we think the "frequency encoder" is useful enough). (I need to give this feedback on the PR, but that will be for after the holidays) Joris > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kouichi.matsuda at gmail.com Sat Dec 15 09:02:06 2018 From: kouichi.matsuda at gmail.com (Kouichi Matsuda) Date: Sat, 15 Dec 2018 09:02:06 -0500 Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower than that on macOS? In-Reply-To: null Message-ID: Hi Hi everyone, I am writing a scikit-learn program to use MLPClassifier to learn Fashion-MNIST. The following is the program. It's very simple. When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book, it took about 4 minutes. However, when I ran it on MacBook(macOS), it took about 1 minutes. Does anyone help me to understand the reason why Windows 10 is so slow? Am I missing something? Thanks, import os import gzip import numpy as np #from https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py def load_mnist(path, kind='train'): labels_path = os.path.join(path,'%s-labels-idx1-ubyte.gz' % kind) images_path = os.path.join(path,'%s-images-idx3-ubyte.gz' % kind) with gzip.open(labels_path, 'rb') as lbpath: labels = np.frombuffer(lbpath.read(), dtype=np.uint8, offset=8) with gzip.open(images_path, 'rb') as imgpath: images = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images = images.reshape(len(labels), 784) return images, labels x_train, y_train = load_mnist('data', kind='train') x_test, y_test = load_mnist('data', kind='t10k') from sklearn.neural_network import MLPClassifier import time import datetime print(datetime.datetime.today()) start = time.time() mlp = MLPClassifier() mlp.fit(x_train, y_train) print((time.time() - start)/ 60) --- MATSUDA, Kouichi, Ph.D. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sat Dec 15 10:47:33 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 15 Dec 2018 16:47:33 +0100 Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower than that on macOS? In-Reply-To: References: Message-ID: <20181215154733.ljxvqdx7jfuhn3nx@phare.normalesup.org> I suspect that it is probably due to the linear-algebra libraries: your scientific Python install on macOS is probably using optimized linear-algebra (ie optimized numpy and scipy), but not your install on Windows. I would recommend you to look at how you installed you Python distribution on macOS and on Windows, as you likely have installed an optimized one on one of the platforms and not on the other. Cheers, Ga?l On Sat, Dec 15, 2018 at 09:02:06AM -0500, Kouichi Matsuda wrote: > Hi?Hi everyone, > I am writing a scikit-learn program to use MLPClassifier to learn > Fashion-MNIST. > The following is the program. It's very simple. > When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book, it took > about 4 minutes. > However, when I ran it on MacBook(macOS), it took about 1 minutes. > Does anyone help me to understand the reason why Windows 10 is so slow? > Am I missing something? > Thanks,?? > import os import gzip import numpy as np #from https://github.com/ > zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py def load_mnist > (path, kind='train'): labels_path = os.path.join(path,'%s-labels-idx1-ubyte.gz' > % kind) images_path = os.path.join(path,'%s-images-idx3-ubyte.gz' % kind) with > gzip.open(labels_path, 'rb') as lbpath: labels = np.frombuffer(lbpath.read(), > dtype=np.uint8, offset=8) with gzip.open(images_path, 'rb') as imgpath: images > = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images = > images.reshape(len(labels), 784) return images, labels x_train, y_train = > load_mnist('data', kind='train') x_test, y_test = load_mnist('data', kind= > 't10k') from sklearn.neural_network import MLPClassifier import time import > datetime print(datetime.datetime.today()) start = time.time() mlp = > MLPClassifier() mlp.fit(x_train, y_train) print((time.time() - start)/ 60) > --- > MATSUDA, Kouichi, Ph.D. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From minminmail at hotmail.com Sun Dec 16 17:09:22 2018 From: minminmail at hotmail.com (rui min) Date: Sun, 16 Dec 2018 22:09:22 +0000 Subject: [scikit-learn] plan to add the association rule classification algorithm in scikit learn Message-ID: Dear scikit-learn developers, I am Rui from Spain, Granada University. Currently I am planning to write an association rule algorithm in scikit-learn. I don?t know if anyone is working on that. So avoid duplication of the work, I would like to ask here. Hope to hear from you soon. Best Regards Rui -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Dec 17 01:26:26 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 17 Dec 2018 17:26:26 +1100 Subject: [scikit-learn] plan to add the association rule classification algorithm in scikit learn In-Reply-To: References: Message-ID: Hi Rui, This has been discussed several times on the mailing list and issue tracker. We are not interested in association rule mining in Scikit-learn for its own purposes. We would be interested in association rule mining only as part of a classification algorithm. Are there such algorithms which are mature and popular enough to meet our inclusion criteria (see our FAQ)? Cheers, Joel On Mon, 17 Dec 2018 at 09:24, rui min wrote: > Dear scikit-learn developers, > > > I am Rui from Spain, Granada University. Currently I am planning to > write an association rule algorithm in scikit-learn. > > I don?t know if anyone is working on that. So avoid duplication of the > work, I would like to ask here. > > > Hope to hear from you soon. > > > > Best Regards > > > > Rui > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Mon Dec 17 01:46:56 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Mon, 17 Dec 2018 00:46:56 -0600 Subject: [scikit-learn] plan to add the association rule classification algorithm in scikit learn In-Reply-To: References: Message-ID: <8F02D137-802B-460C-9F02-B39967FDDB6D@sebastianraschka.com> Hi Rui, I agree with Joel that association rule mining could be a bit tricky to fit nicely within the scikit-learn API. Maybe this could be some transformer class? I thought about that a few years ago but remember that I couldn't come up with a good solution at that point. In any case, I have an association rule implementation in mlxtend (http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/), which is based on the apriori algorithm. Some users were asking about Eclat and FP-Growth algorithms, instead of apriori. If you are interested in such a contribution, i.e., implementing Eclat or FP-Growth such that instead of frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True) association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7) one could use frequent_itemsets = eclat(df, min_support=0.6, use_colnames=True) or frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True) association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7) I would be very happy about such a contribution (see issue tracker at https://github.com/rasbt/mlxtend/issues/248) If you had an alternative algorithm for frequent itemset generation in mind (I am not sure if others exist, to be honest). I would also be happy about that one, too. Best, Sebastian > On Dec 17, 2018, at 12:26 AM, Joel Nothman wrote: > > Hi Rui, > > This has been discussed several times on the mailing list and issue tracker. We are not interested in association rule mining in Scikit-learn for its own purposes. We would be interested in association rule mining only as part of a classification algorithm. Are there such algorithms which are mature and popular enough to meet our inclusion criteria (see our FAQ)? > > Cheers, > > Joel > > On Mon, 17 Dec 2018 at 09:24, rui min wrote: > Dear scikit-learn developers, > > I am Rui from Spain, Granada University. Currently I am planning to write an association rule algorithm in scikit-learn. > I don?t know if anyone is working on that. So avoid duplication of the work, I would like to ask here. > > Hope to hear from you soon. > > > Best Regards > > > Rui > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From kouichi.matsuda at gmail.com Mon Dec 17 09:54:07 2018 From: kouichi.matsuda at gmail.com (Kouichi Matsuda) Date: Mon, 17 Dec 2018 06:54:07 -0800 Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower than that on macOS? In-Reply-To: <20181215154733.ljxvqdx7jfuhn3nx@phare.normalesup.org> References: <20181215154733.ljxvqdx7jfuhn3nx@phare.normalesup.org> Message-ID: Thank you for your quick reply. It's very helpful. It's because of Anaconda: Its python stops the iteration soon as follows (w/ verbose=True). I am not sure why 'n_iter_no_change=10' is changed in Anaconda. Anaconda might modify the MLPClassifier implementation. > python learn.py (in pure Python+Scikit-Learn) ... Iteration 125, loss = 0.26152263 Iteration 126, loss = 0.25705940 Iteration 127, loss = 0.25957841 Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping. 0.8496 > python learn.py (in Anaconda) ... Iteration 23, loss = 0.34410594 Iteration 24, loss = 0.34663903 Iteration 25, loss = 0.34376815 Training loss did not improve more than tol=0.000100 for two consecutive epochs. Stopping. 0.852 Thanks, --- ???? MATSUDA, Kouichi, Ph.D. 2018?12?16?(?) 0:50 Gael Varoquaux : > I suspect that it is probably due to the linear-algebra libraries: your > scientific Python install on macOS is probably using optimized > linear-algebra (ie optimized numpy and scipy), but not your install on > Windows. > > I would recommend you to look at how you installed you Python > distribution on macOS and on Windows, as you likely have installed an > optimized one on one of the platforms and not on the other. > > Cheers, > > Ga?l > > On Sat, Dec 15, 2018 at 09:02:06AM -0500, Kouichi Matsuda wrote: > > Hi Hi everyone, > > > I am writing a scikit-learn program to use MLPClassifier to learn > > Fashion-MNIST. > > The following is the program. It's very simple. > > When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book, it > took > > about 4 minutes. > > However, when I ran it on MacBook(macOS), it took about 1 minutes. > > Does anyone help me to understand the reason why Windows 10 is so slow? > > Am I missing something? > > > Thanks, > > > import os import gzip import numpy as np #from https://github.com/ > > zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py def > load_mnist > > (path, kind='train'): labels_path = > os.path.join(path,'%s-labels-idx1-ubyte.gz' > > % kind) images_path = os.path.join(path,'%s-images-idx3-ubyte.gz' % > kind) with > > gzip.open(labels_path, 'rb') as lbpath: labels = > np.frombuffer(lbpath.read(), > > dtype=np.uint8, offset=8) with gzip.open(images_path, 'rb') as imgpath: > images > > = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images = > > images.reshape(len(labels), 784) return images, labels x_train, y_train = > > load_mnist('data', kind='train') x_test, y_test = load_mnist('data', > kind= > > 't10k') from sklearn.neural_network import MLPClassifier import time > import > > datetime print(datetime.datetime.today()) start = time.time() mlp = > > MLPClassifier() mlp.fit(x_train, y_train) print((time.time() - start)/ > 60) > > > > --- > > MATSUDA, Kouichi, Ph.D. > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon Dec 17 10:01:54 2018 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Mon, 17 Dec 2018 16:01:54 +0100 Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower than that on macOS? In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon Dec 17 10:14:52 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Mon, 17 Dec 2018 16:14:52 +0100 Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower than that on macOS? In-Reply-To: References: Message-ID: I checked on 0.20.1 using scikit-learn shipped by Anaconda and both seem to have the same default. On Mon, 17 Dec 2018 at 16:01, Guillaume Lema?tre wrote: > could you provide the scikit-learn version in both case? > > Sent from my phone - sorry to be brief and potential misspell. > *From:* kouichi.matsuda at gmail.com > *Sent:* 17 December 2018 15:56 > *To:* scikit-learn at python.org > *Reply to:* scikit-learn at python.org > *Subject:* Re: [scikit-learn] MLPClassifier on WIndows 10 is 4 times > slower than that on macOS? > > Thank you for your quick reply. It's very helpful. > It's because of Anaconda: Its python stops the iteration soon as follows > (w/ verbose=True). > I am not sure why 'n_iter_no_change=10' is changed in Anaconda. > Anaconda might modify the MLPClassifier implementation. > > > python learn.py (in pure Python+Scikit-Learn) > ... > > Iteration 125, loss = 0.26152263 > > Iteration 126, loss = 0.25705940 > > Iteration 127, loss = 0.25957841 > > Training loss did not improve more than tol=0.000100 for 10 consecutive > epochs. Stopping. > 0.8496 > > > python learn.py (in Anaconda) > ... > Iteration 23, loss = 0.34410594 > Iteration 24, loss = 0.34663903 > Iteration 25, loss = 0.34376815 > Training loss did not improve more than tol=0.000100 for two consecutive > epochs. Stopping. > 0.852 > > Thanks, > > > --- > ???? MATSUDA, Kouichi, Ph.D. > > > 2018?12?16?(?) 0:50 Gael Varoquaux : > >> I suspect that it is probably due to the linear-algebra libraries: your >> scientific Python install on macOS is probably using optimized >> linear-algebra (ie optimized numpy and scipy), but not your install on >> Windows. >> >> I would recommend you to look at how you installed you Python >> distribution on macOS and on Windows, as you likely have installed an >> optimized one on one of the platforms and not on the other. >> >> Cheers, >> >> Ga?l >> >> On Sat, Dec 15, 2018 at 09:02:06AM -0500, Kouichi Matsuda wrote: >> > Hi Hi everyone, >> >> > I am writing a scikit-learn program to use MLPClassifier to learn >> > Fashion-MNIST. >> > The following is the program. It's very simple. >> > When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book, it >> took >> > about 4 minutes. >> > However, when I ran it on MacBook(macOS), it took about 1 minutes. >> > Does anyone help me to understand the reason why Windows 10 is so slow? >> > Am I missing something? >> >> > Thanks, >> >> > import os import gzip import numpy as np #from https://github.com/ >> > zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py def >> load_mnist >> > (path, kind='train'): labels_path = os.path.join(path,'% >> s-labels-idx1-ubyte.gz' >> > % kind) images_path = os.path.join(path,'%s-images-idx3-ubyte.gz' % >> kind) with >> > gzip.open(labels_path, 'rb') as lbpath: labels = np.frombuffer( >> lbpath.read(), >> > dtype=np.uint8, offset=8) with gzip.open(images_path, 'rb') as >> imgpath: images >> > = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images = >> > images.reshape(len(labels), 784) return images, labels x_train, >> y_train = >> > load_mnist('data', kind='train') x_test, y_test = load_mnist('data', >> kind= >> > 't10k') from sklearn.neural_network import MLPClassifier import time >> import >> > datetime print(datetime.datetime.today()) start = time.time() mlp = >> > MLPClassifier() mlp.fit(x_train, y_train) print((time.time() - start)/ >> 60) >> >> >> > --- >> > MATSUDA, Kouichi, Ph.D. >> >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> -- >> Gael Varoquaux >> Senior Researcher, INRIA Parietal >> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France >> Phone: ++ 33-1-69-08-79-68 <+33169087968> >> http://gael-varoquaux.info >> http://twitter.com/GaelVaroquaux >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Dec 17 22:07:31 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 17 Dec 2018 22:07:31 -0500 Subject: [scikit-learn] plan to add the association rule classification algorithm in scikit learn In-Reply-To: <8F02D137-802B-460C-9F02-B39967FDDB6D@sebastianraschka.com> References: <8F02D137-802B-460C-9F02-B39967FDDB6D@sebastianraschka.com> Message-ID: Can we add this to the FAQ as out of scope? Sebastian: feel free to put more into mlxtend :P On 12/17/18 1:46 AM, Sebastian Raschka wrote: > Hi Rui, > > I agree with Joel that association rule mining could be a bit tricky to fit nicely within the scikit-learn API. Maybe this could be some transformer class? I thought about that a few years ago but remember that I couldn't come up with a good solution at that point. > > In any case, I have an association rule implementation in mlxtend (http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/), which is based on the apriori algorithm. Some users were asking about Eclat and FP-Growth algorithms, instead of apriori. If you are interested in such a contribution, i.e., implementing Eclat or FP-Growth such that instead of > > frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True) > association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7) > > one could use > > frequent_itemsets = eclat(df, min_support=0.6, use_colnames=True) > > or > > frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True) > association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7) > > I would be very happy about such a contribution (see issue tracker at https://github.com/rasbt/mlxtend/issues/248) > > If you had an alternative algorithm for frequent itemset generation in mind (I am not sure if others exist, to be honest). I would also be happy about that one, too. > > Best, > Sebastian > >> On Dec 17, 2018, at 12:26 AM, Joel Nothman wrote: >> >> Hi Rui, >> >> This has been discussed several times on the mailing list and issue tracker. We are not interested in association rule mining in Scikit-learn for its own purposes. We would be interested in association rule mining only as part of a classification algorithm. Are there such algorithms which are mature and popular enough to meet our inclusion criteria (see our FAQ)? >> >> Cheers, >> >> Joel >> >> On Mon, 17 Dec 2018 at 09:24, rui min wrote: >> Dear scikit-learn developers, >> >> I am Rui from Spain, Granada University. Currently I am planning to write an association rule algorithm in scikit-learn. >> I don?t know if anyone is working on that. So avoid duplication of the work, I would like to ask here. >> >> Hope to hear from you soon. >> >> >> Best Regards >> >> >> Rui >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From dmitrii.ignatov at gmail.com Tue Dec 18 03:17:00 2018 From: dmitrii.ignatov at gmail.com (Dmitry Ignatov) Date: Tue, 18 Dec 2018 11:17:00 +0300 Subject: [scikit-learn] plan to add the association rule classification algorithm in scikit learn In-Reply-To: <8F02D137-802B-460C-9F02-B39967FDDB6D@sebastianraschka.com> References: <8F02D137-802B-460C-9F02-B39967FDDB6D@sebastianraschka.com> Message-ID: Hi All, Just a short comment to "If you had an alternative algorithm for frequent itemset generation in mind (I am not sure if others exist, to be honest). I would also be happy about that one, too." There are many other techniques and their modifications for related problems like sequence mining, see e.g. here: http://www.philippe-fournier-viger.com/spmf/. In my opinion, a notable difference for practice exists between frequent itemsets and closed (frequent) itemsets; the latter may reduce an output drastically. However, combinatorial explosion w.r.t. the number of produced patterns is an issue here. Best, Dmitry ??, 17 ???. 2018 ?. ? 10:12, Sebastian Raschka : > Hi Rui, > > I agree with Joel that association rule mining could be a bit tricky to > fit nicely within the scikit-learn API. Maybe this could be some > transformer class? I thought about that a few years ago but remember that I > couldn't come up with a good solution at that point. > > In any case, I have an association rule implementation in mlxtend ( > http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/), > which is based on the apriori algorithm. Some users were asking about Eclat > and FP-Growth algorithms, instead of apriori. If you are interested in such > a contribution, i.e., implementing Eclat or FP-Growth such that instead of > > frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True) > association_rules(frequent_itemsets, metric="confidence", > min_threshold=0.7) > > one could use > > frequent_itemsets = eclat(df, min_support=0.6, use_colnames=True) > > or > > frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True) > association_rules(frequent_itemsets, metric="confidence", > min_threshold=0.7) > > I would be very happy about such a contribution (see issue tracker at > https://github.com/rasbt/mlxtend/issues/248) > > If you had an alternative algorithm for frequent itemset generation in > mind (I am not sure if others exist, to be honest). I would also be happy > about that one, too. > > Best, > Sebastian > > > On Dec 17, 2018, at 12:26 AM, Joel Nothman > wrote: > > > > Hi Rui, > > > > This has been discussed several times on the mailing list and issue > tracker. We are not interested in association rule mining in Scikit-learn > for its own purposes. We would be interested in association rule mining > only as part of a classification algorithm. Are there such algorithms which > are mature and popular enough to meet our inclusion criteria (see our FAQ)? > > > > Cheers, > > > > Joel > > > > On Mon, 17 Dec 2018 at 09:24, rui min wrote: > > Dear scikit-learn developers, > > > > I am Rui from Spain, Granada University. Currently I am planning to > write an association rule algorithm in scikit-learn. > > I don?t know if anyone is working on that. So avoid duplication of > the work, I would like to ask here. > > > > Hope to hear from you soon. > > > > > > Best Regards > > > > > > Rui > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kouichi.matsuda at gmail.com Tue Dec 18 07:16:00 2018 From: kouichi.matsuda at gmail.com (Kouichi Matsuda) Date: Tue, 18 Dec 2018 21:16:00 +0900 Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower than that on macOS? In-Reply-To: References: Message-ID: Great! Thanks Woops, the latest Anaconda does not support the latest scikit-learn... >>> print(sklearn.__version__) 0.19.2 I should have checked the change log ... orz >> n_iter_no_change parameter now at 10 from previously hardcoded 2. #9456 by Nicholas Nadeau. It might be confusing to change it to be severer. Thanks and sorry for bothering you. --- ???? MATSUDA, Kouichi, Ph.D. 2018?12?18?(?) 0:17 Guillaume Lema?tre : > I checked on 0.20.1 using scikit-learn shipped by Anaconda and both seem > to have the same default. > > On Mon, 17 Dec 2018 at 16:01, Guillaume Lema?tre > wrote: > >> could you provide the scikit-learn version in both case? >> >> Sent from my phone - sorry to be brief and potential misspell. >> *From:* kouichi.matsuda at gmail.com >> *Sent:* 17 December 2018 15:56 >> *To:* scikit-learn at python.org >> *Reply to:* scikit-learn at python.org >> *Subject:* Re: [scikit-learn] MLPClassifier on WIndows 10 is 4 times >> slower than that on macOS? >> >> Thank you for your quick reply. It's very helpful. >> It's because of Anaconda: Its python stops the iteration soon as follows >> (w/ verbose=True). >> I am not sure why 'n_iter_no_change=10' is changed in Anaconda. >> Anaconda might modify the MLPClassifier implementation. >> >> > python learn.py (in pure Python+Scikit-Learn) >> ... >> >> Iteration 125, loss = 0.26152263 >> >> Iteration 126, loss = 0.25705940 >> >> Iteration 127, loss = 0.25957841 >> >> Training loss did not improve more than tol=0.000100 for 10 consecutive >> epochs. Stopping. >> 0.8496 >> >> > python learn.py (in Anaconda) >> ... >> Iteration 23, loss = 0.34410594 >> Iteration 24, loss = 0.34663903 >> Iteration 25, loss = 0.34376815 >> Training loss did not improve more than tol=0.000100 for two consecutive >> epochs. Stopping. >> 0.852 >> >> Thanks, >> >> >> --- >> ???? MATSUDA, Kouichi, Ph.D. >> >> >> 2018?12?16?(?) 0:50 Gael Varoquaux : >> >>> I suspect that it is probably due to the linear-algebra libraries: your >>> scientific Python install on macOS is probably using optimized >>> linear-algebra (ie optimized numpy and scipy), but not your install on >>> Windows. >>> >>> I would recommend you to look at how you installed you Python >>> distribution on macOS and on Windows, as you likely have installed an >>> optimized one on one of the platforms and not on the other. >>> >>> Cheers, >>> >>> Ga?l >>> >>> On Sat, Dec 15, 2018 at 09:02:06AM -0500, Kouichi Matsuda wrote: >>> > Hi Hi everyone, >>> >>> > I am writing a scikit-learn program to use MLPClassifier to learn >>> > Fashion-MNIST. >>> > The following is the program. It's very simple. >>> > When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book, >>> it took >>> > about 4 minutes. >>> > However, when I ran it on MacBook(macOS), it took about 1 minutes. >>> > Does anyone help me to understand the reason why Windows 10 is so slow? >>> > Am I missing something? >>> >>> > Thanks, >>> >>> > import os import gzip import numpy as np #from https://github.com/ >>> > zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py def >>> load_mnist >>> > (path, kind='train'): labels_path = os.path.join(path,'% >>> s-labels-idx1-ubyte.gz' >>> > % kind) images_path = os.path.join(path,'%s-images-idx3-ubyte.gz' % >>> kind) with >>> > gzip.open(labels_path, 'rb') as lbpath: labels = np.frombuffer( >>> lbpath.read(), >>> > dtype=np.uint8, offset=8) with gzip.open(images_path, 'rb') as >>> imgpath: images >>> > = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images = >>> > images.reshape(len(labels), 784) return images, labels x_train, >>> y_train = >>> > load_mnist('data', kind='train') x_test, y_test = load_mnist('data', >>> kind= >>> > 't10k') from sklearn.neural_network import MLPClassifier import time >>> import >>> > datetime print(datetime.datetime.today()) start = time.time() mlp = >>> > MLPClassifier() mlp.fit(x_train, y_train) print((time.time() - >>> start)/ 60) >>> >>> >>> > --- >>> > MATSUDA, Kouichi, Ph.D. >>> >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> -- >>> Gael Varoquaux >>> Senior Researcher, INRIA Parietal >>> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France >>> Phone: ++ 33-1-69-08-79-68 <+33169087968> >>> http://gael-varoquaux.info >>> http://twitter.com/GaelVaroquaux >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Tue Dec 18 12:05:21 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 18 Dec 2018 18:05:21 +0100 Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower than that on macOS? In-Reply-To: References: Message-ID: You should probably just "conda update scikit-learn": scikit-learn 0.20.1 is available on the official anaconda channel for all supported operating systems: https://anaconda.org/anaconda/scikit-learn -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From kouichi.matsuda at gmail.com Tue Dec 18 18:58:32 2018 From: kouichi.matsuda at gmail.com (Kouichi Matsuda) Date: Wed, 19 Dec 2018 08:58:32 +0900 Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower than that on macOS? In-Reply-To: References: Message-ID: Great! Thanks! 2018?12?19?(?) ??2:07?Olivier Grisel ???olivier.grisel at ensta.org???????: > You should probably just "conda update scikit-learn": > > scikit-learn 0.20.1 is available on the official anaconda channel for all > supported operating systems: > https://anaconda.org/anaconda/scikit-learn > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Dec 19 17:27:21 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 19 Dec 2018 17:27:21 -0500 Subject: [scikit-learn] Next Sprint In-Reply-To: <20181120192519.gbagzrvzzqljglme@phare.normalesup.org> References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com> <20181120192519.gbagzrvzzqljglme@phare.normalesup.org> Message-ID: <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com> Can we please nail down dates for a sprint? On 11/20/18 2:25 PM, Gael Varoquaux wrote: > On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: >> We can also do Paris in April / May or June if that's ok with Joel and better >> for Andreas. > Absolutely. > > My thoughts here are that I want to minimize transportation, partly > because flying has a large carbon footprint. Also, for personal reasons, > I am not sure that I will be able to make it to Austin in July, but I > realize that this is a pretty bad argument. > > We're happy to try to host in Paris whenever it's most convenient and to > try to help with travel for those not in Paris. > > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Wed Dec 19 17:31:23 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 19 Dec 2018 17:31:23 -0500 Subject: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories In-Reply-To: References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> <20181121153424.i3b7orguqhm243el@phare.normalesup.org> <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com> <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org> <26d9146b-f673-ba0e-11d6-4266bec48407@gmail.com> Message-ID: <8fa34e9c-7205-4803-16c1-7c52b09178ee@gmail.com> On 12/15/18 7:35 AM, Joris Van den Bossche wrote: > Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller >: > >> As far as I understand, the open PR is not a leave-one-out >> TargetEncoder? > I would want it to be :-/ >> I also did not yet add the CountFeaturizer from that scikit-learn >> PR, because it is actually quite different (e.g it doesn't work >> for regression tasks, as it counts conditional on y). But for >> classification it could be easily added to the benchmarks. > I'm confused now. That's what TargetEncoder and leave-one-out > TargetEncoder do as well, right?. > > > As far as I understand, that is not exactly what those do. The > TargetEncoder (as implemented in dirty_cat, category_encoders and > hccEncoders) will, for each category, calculate the expected value of > the target depending on the category. For binary classification this > indeed comes to counting the 0's and 1's, and there the information > contained in the result might be similar as the sklearn PR, but the > format is different: those packages calculate the probability (value > between 0 and 1 as number of 1's divided by number of samples in that > category) and return that as a single column, instead of returning two > columns with the counts for the 0's and 1's. This is a standard case of the "binary special case", right? For multi-class you need multiple columns, right? Doing a single column for binary makes sense, I think. > And for regression this is not related to counting anymore, but just > the average of the target per category (in practice, the TargetEncoder > is computing the same for regression or binary classification: the > average of the target per category. But for regression, the > CountFeaturizer doesn't work since there are no discrete values in the > target to count). I guess CountFeaturizer was not implemented with regression in mind. Actually being able to do regression and classification in the same estimator shows that "CountFeaturizer" is probably the wrong name. > > Furthermore, all of those implementations in the 3 mentioned packages > have some kind of regularization (empirical bayes shrinkage, or KFold > or leave-one-out cross-validation), while this is also not present in > the CountFeaturizer PR (but this aspect is of course something we want > to actually test in the benchmarks). > > Another thing I noticed in the CountFeaturizer implementation, is that > the behaviour differs when y is passed or not. First, I find it a bit > strange to do this as it is a quite different behaviour (counting the > categories (to just encode the categorical variable with a notion > about its frequency in the training set), or counting the target > depending on the category is quite different?). But also, when using a > transformer in a Pipeline, you don't control the passing of y, I > think? So in that way, you always have the behaviour of counting the > target. > I would find it more logical to have those two things in two separate > transformers (if we think the "frequency encoder" is useful enough). > (I need to give this feedback on the PR, but that will be for after > the holidays) > I'm pretty sure I mentioned that before, I think optional y is bad. I just thought it was weird but the pipeline argument is a good one. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Dec 19 17:33:02 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 19 Dec 2018 23:33:02 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com> References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com> <20181120192519.gbagzrvzzqljglme@phare.normalesup.org> <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com> Message-ID: <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org> I would propose the week of Feb 25th, as I heard people say that they might be available at this time. It is good for many people, or should we organize a doodle? G On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: > Can we please nail down dates for a sprint? > On 11/20/18 2:25 PM, Gael Varoquaux wrote: > > On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: > > > We can also do Paris in April / May or June if that's ok with Joel and better > > > for Andreas. > > Absolutely. > > My thoughts here are that I want to minimize transportation, partly > > because flying has a large carbon footprint. Also, for personal reasons, > > I am not sure that I will be able to make it to Austin in July, but I > > realize that this is a pretty bad argument. > > We're happy to try to host in Paris whenever it's most convenient and to > > try to help with travel for those not in Paris. > > Ga?l > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From pahome.chen at mirlab.org Thu Dec 20 02:09:34 2018 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 20 Dec 2018 15:09:34 +0800 Subject: [scikit-learn] time complexity of tree-based model? Message-ID: I do some benchmark in my experiments and I almost use ensemble-based regressor. What is the time complexity if I use random forest regressor? Assume I only set variable * estimators=100* and others doesn't enter. thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Thu Dec 20 02:19:48 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Thu, 20 Dec 2018 01:19:48 -0600 Subject: [scikit-learn] time complexity of tree-based model? In-Reply-To: References: Message-ID: <9CEAACA6-67A4-4382-AA5B-BDD6788D9905@sebastianraschka.com> Say n is the number of examples and m is the number of features, then a naive implementation of a balanced binary decision tree is O(m * n^2 log n). I think scikit-learn's decision tree cache the sorted features, so this reduces to O(m * n log n). Than, to your O(m * n log n) you can multiply the number of decision trees in the forest Best, Sebastian > On Dec 20, 2018, at 1:09 AM, lampahome wrote: > > I do some benchmark in my experiments and I almost use ensemble-based regressor. > > What is the time complexity if I use random forest regressor? Assume I only set variable estimators=100 and others doesn't enter. > > thx > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From aneto at chatdesk.com Thu Dec 20 10:20:15 2018 From: aneto at chatdesk.com (Aneto) Date: Thu, 20 Dec 2018 09:20:15 -0600 Subject: [scikit-learn] How to keep a model running in memory? Message-ID: Hi scikit learn community, We currently use scikit-learn for a model that generates predictions on a server endpoint. We would like to keep the model running in memory instead of having to re-load the model for every new request that comes in to the server. Can you please point us in the right direction for this? Any tutorials or examples. In case it's helpful, we use Flask for our web server. Thank you! Aneto -------------- next part -------------- An HTML attachment was scrubbed... URL: From rhochmuth at alteryx.com Thu Dec 20 11:12:18 2018 From: rhochmuth at alteryx.com (Roland Hochmuth) Date: Thu, 20 Dec 2018 16:12:18 +0000 Subject: [scikit-learn] How to keep a model running in memory? In-Reply-To: References: Message-ID: Hi Liam, Not sure I have the complete context for what you are trying to do, but have you considered using Python multiprocessing to start a separate process? The lifecycle of that process could start when the Flask server starts-up or on the first request. The separate process would load and run the model. Depending on what you would like to do, some form of IPC mechanism, such as gRPC could be used to control or get updates from the model process. Regards --Roland From: scikit-learn on behalf of Aneto Reply-To: Scikit-learn mailing list Date: Thursday, December 20, 2018 at 8:21 AM To: "scikit-learn at python.org" Cc: Liam Geron Subject: [scikit-learn] How to keep a model running in memory? Hi scikit learn community, We currently use scikit-learn for a model that generates predictions on a server endpoint. We would like to keep the model running in memory instead of having to re-load the model for every new request that comes in to the server. Can you please point us in the right direction for this? Any tutorials or examples. In case it's helpful, we use Flask for our web server. Thank you! Aneto -------------- next part -------------- An HTML attachment was scrubbed... URL: From leefrance79 at gmail.com Thu Dec 20 12:14:35 2018 From: leefrance79 at gmail.com (=?utf-8?B?7J207J246rec?=) Date: Thu, 20 Dec 2018 12:14:35 -0500 Subject: [scikit-learn] Submission Message-ID: <6C947338-7776-44C5-908D-862C3F76135A@gmail.com> Leon LEE Leefrance79 at gmail.com Skype: leefrance7979 From t3kcit at gmail.com Thu Dec 20 12:44:24 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 20 Dec 2018 12:44:24 -0500 Subject: [scikit-learn] Next Sprint In-Reply-To: <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org> References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com> <20181120192519.gbagzrvzzqljglme@phare.normalesup.org> <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com> <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org> Message-ID: Works for me! On 12/19/18 5:33 PM, Gael Varoquaux wrote: > I would propose the week of Feb 25th, as I heard people say that they > might be available at this time. It is good for many people, or should we > organize a doodle? > > G > > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: >> Can we please nail down dates for a sprint? >> On 11/20/18 2:25 PM, Gael Varoquaux wrote: >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: >>>> We can also do Paris in April / May or June if that's ok with Joel and better >>>> for Andreas. >>> Absolutely. >>> My thoughts here are that I want to minimize transportation, partly >>> because flying has a large carbon footprint. Also, for personal reasons, >>> I am not sure that I will be able to make it to Austin in July, but I >>> realize that this is a pretty bad argument. >>> We're happy to try to host in Paris whenever it's most convenient and to >>> try to help with travel for those not in Paris. >>> Ga?l >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn From rhochmuth at alteryx.com Thu Dec 20 13:07:31 2018 From: rhochmuth at alteryx.com (Roland Hochmuth) Date: Thu, 20 Dec 2018 18:07:31 +0000 Subject: [scikit-learn] How to keep a model running in memory? In-Reply-To: References: Message-ID: <5464DD75-2F91-4A7F-9F8D-959CE33C8480@alteryx.com> Hi Liam, I would suggest start out by taking a look at the gRPC quickstart for Python at, https://grpc.io/docs/quickstart/python.html and then modifying that example to do what you would like. The Flask server would launch the separate process using multiprocessing. The model process would create a gRPC service endpoint. The Flask server would wait for the model process to start and then establish a gRPC connection as a client to the gRPC service endpoint of the model process. The gRPC service of the model process would have methods, such as trainModel or getModelStatus, ? When an http request occurs on the Flask http server, the server would then invoke the gRPC methods in the model process. I hope that helps. Regards --Roland From: Liam Geron Date: Thursday, December 20, 2018 at 9:53 AM To: Roland Hochmuth Cc: Scikit-learn mailing list Subject: Re: [scikit-learn] How to keep a model running in memory? Hi Roland, Thanks for the suggestion! I'll certainly look into gRPC or similar frameworks. Currently we have multiprocessing, but it's not used to that same extent. How would the second process have a sort of "listener" to respond to incoming requests if it is running persistently? Thanks so much for the help. Best, Liam On Thu, Dec 20, 2018 at 11:12 AM Roland Hochmuth > wrote: Hi Liam, Not sure I have the complete context for what you are trying to do, but have you considered using Python multiprocessing to start a separate process? The lifecycle of that process could start when the Flask server starts-up or on the first request. The separate process would load and run the model. Depending on what you would like to do, some form of IPC mechanism, such as gRPC could be used to control or get updates from the model process. Regards --Roland From: scikit-learn > on behalf of Aneto > Reply-To: Scikit-learn mailing list > Date: Thursday, December 20, 2018 at 8:21 AM To: "scikit-learn at python.org" > Cc: Liam Geron > Subject: [scikit-learn] How to keep a model running in memory? Hi scikit learn community, We currently use scikit-learn for a model that generates predictions on a server endpoint. We would like to keep the model running in memory instead of having to re-load the model for every new request that comes in to the server. Can you please point us in the right direction for this? Any tutorials or examples. In case it's helpful, we use Flask for our web server. Thank you! Aneto -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Thu Dec 20 14:32:46 2018 From: adrin.jalali at gmail.com (Adrin) Date: Thu, 20 Dec 2018 20:32:46 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com> <20181120192519.gbagzrvzzqljglme@phare.normalesup.org> <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com> <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org> Message-ID: It'll be the least favourable week of February for me, but I can make do. On Thu, 20 Dec 2018 at 18:45 Andreas Mueller wrote: > Works for me! > > On 12/19/18 5:33 PM, Gael Varoquaux wrote: > > I would propose the week of Feb 25th, as I heard people say that they > > might be available at this time. It is good for many people, or should we > > organize a doodle? > > > > G > > > > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: > >> Can we please nail down dates for a sprint? > >> On 11/20/18 2:25 PM, Gael Varoquaux wrote: > >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: > >>>> We can also do Paris in April / May or June if that's ok with Joel > and better > >>>> for Andreas. > >>> Absolutely. > >>> My thoughts here are that I want to minimize transportation, partly > >>> because flying has a large carbon footprint. Also, for personal > reasons, > >>> I am not sure that I will be able to make it to Austin in July, but I > >>> realize that this is a pretty bad argument. > >>> We're happy to try to host in Paris whenever it's most convenient and > to > >>> try to help with travel for those not in Paris. > >>> Ga?l > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at inria.fr Thu Dec 20 15:19:04 2018 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Thu, 20 Dec 2018 21:19:04 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com> <20181120192519.gbagzrvzzqljglme@phare.normalesup.org> <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com> <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org> Message-ID: ok for me Alex On Thu, Dec 20, 2018 at 8:35 PM Adrin wrote: > > It'll be the least favourable week of February for me, but I can make do. > > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller wrote: >> >> Works for me! >> >> On 12/19/18 5:33 PM, Gael Varoquaux wrote: >> > I would propose the week of Feb 25th, as I heard people say that they >> > might be available at this time. It is good for many people, or should we >> > organize a doodle? >> > >> > G >> > >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: >> >> Can we please nail down dates for a sprint? >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote: >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: >> >>>> We can also do Paris in April / May or June if that's ok with Joel and better >> >>>> for Andreas. >> >>> Absolutely. >> >>> My thoughts here are that I want to minimize transportation, partly >> >>> because flying has a large carbon footprint. Also, for personal reasons, >> >>> I am not sure that I will be able to make it to Austin in July, but I >> >>> realize that this is a pretty bad argument. >> >>> We're happy to try to host in Paris whenever it's most convenient and to >> >>> try to help with travel for those not in Paris. >> >>> Ga?l >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn at python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From pahome.chen at mirlab.org Thu Dec 20 20:57:16 2018 From: pahome.chen at mirlab.org (lampahome) Date: Fri, 21 Dec 2018 09:57:16 +0800 Subject: [scikit-learn] Does random forest work if there are very few features? Message-ID: I read doc and know tree-based model is determined by entropy or gini impurity. When model try to create leaf nodes, it will determine based on the feature, right? Ex: I have 2 features A,B, and I divide it with A. So I have left and right nodes based on A. It should have the best shape if I create nodes based on A, right? Now if I have 100 estimators but I only have two features, do I have different trees which are all based on feature A? or the shape of trees based on A are all the same cuz they were created by feature A? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Fri Dec 21 10:00:00 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 21 Dec 2018 16:00:00 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com> <20181120192519.gbagzrvzzqljglme@phare.normalesup.org> <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com> <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org> Message-ID: Ok for me. The last 3 weeks of February are fine for me. Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort < alexandre.gramfort at inria.fr> a ?crit : > ok for me > > Alex > > On Thu, Dec 20, 2018 at 8:35 PM Adrin wrote: > > > > It'll be the least favourable week of February for me, but I can make do. > > > > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller wrote: > >> > >> Works for me! > >> > >> On 12/19/18 5:33 PM, Gael Varoquaux wrote: > >> > I would propose the week of Feb 25th, as I heard people say that they > >> > might be available at this time. It is good for many people, or > should we > >> > organize a doodle? > >> > > >> > G > >> > > >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: > >> >> Can we please nail down dates for a sprint? > >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote: > >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: > >> >>>> We can also do Paris in April / May or June if that's ok with Joel > and better > >> >>>> for Andreas. > >> >>> Absolutely. > >> >>> My thoughts here are that I want to minimize transportation, partly > >> >>> because flying has a large carbon footprint. Also, for personal > reasons, > >> >>> I am not sure that I will be able to make it to Austin in July, but > I > >> >>> realize that this is a pretty bad argument. > >> >>> We're happy to try to host in Paris whenever it's most convenient > and to > >> >>> try to help with travel for those not in Paris. > >> >>> Ga?l > >> >>> _______________________________________________ > >> >>> scikit-learn mailing list > >> >>> scikit-learn at python.org > >> >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> >> _______________________________________________ > >> >> scikit-learn mailing list > >> >> scikit-learn at python.org > >> >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at pm.me Sat Dec 22 10:58:21 2018 From: rth.yurchak at pm.me (Roman Yurchak) Date: Sat, 22 Dec 2018 15:58:21 +0000 Subject: [scikit-learn] Next Sprint In-Reply-To: References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <20181120192519.gbagzrvzzqljglme@phare.normalesup.org> <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com> <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org> Message-ID: That works for me as well. On 21/12/2018 16:00, Olivier Grisel wrote: > Ok for me. The last 3 weeks of February are fine for me. > > Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort > > a ?crit?: > > ok for me > > Alex > > On Thu, Dec 20, 2018 at 8:35 PM Adrin > wrote: > > > > It'll be the least favourable week of February for me, but I can > make do. > > > > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller > wrote: > >> > >> Works for me! > >> > >> On 12/19/18 5:33 PM, Gael Varoquaux wrote: > >> > I would propose? the week of Feb 25th, as I heard people say > that they > >> > might be available at this time. It is good for many people, > or should we > >> > organize a doodle? > >> > > >> > G > >> > > >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: > >> >> Can we please nail down dates for a sprint? > >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote: > >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: > >> >>>> We can also do Paris in April / May or June if that's ok > with Joel and better > >> >>>> for Andreas. > >> >>> Absolutely. > >> >>> My thoughts here are that I want to minimize transportation, > partly > >> >>> because flying has a large carbon footprint. Also, for > personal reasons, > >> >>> I am not sure that I will be able to make it to Austin in > July, but I > >> >>> realize that this is a pretty bad argument. > >> >>> We're happy to try to host in Paris whenever it's most > convenient and to > >> >>> try to help with travel for those not in Paris. > >> >>> Ga?l > >> >>> _______________________________________________ > >> >>> scikit-learn mailing list > >> >>> scikit-learn at python.org > >> >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> >> _______________________________________________ > >> >> scikit-learn mailing list > >> >> scikit-learn at python.org > >> >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From g.lemaitre58 at gmail.com Sat Dec 22 11:27:39 2018 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Sat, 22 Dec 2018 17:27:39 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: Message-ID: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com> Works for me as well. Sent from my phone - sorry to be brief and potential misspell. ? Original Message ? From: scikit-learn at python.org Sent: 22 December 2018 17:17 To: scikit-learn at python.org Reply to: rth.yurchak at pm.me; scikit-learn at python.org Cc: rth.yurchak at pm.me Subject: Re: [scikit-learn] Next Sprint That works for me as well. On 21/12/2018 16:00, Olivier Grisel wrote: > Ok for me. The last 3 weeks of February are fine for me. > > Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort > > a ?crit?: > >???? ok for me > >???? Alex > >???? On Thu, Dec 20, 2018 at 8:35 PM Adrin ???? > wrote: >????? > >????? > It'll be the least favourable week of February for me, but I can >???? make do. >????? > >????? > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller ???? > wrote: >????? >> >????? >> Works for me! >????? >> >????? >> On 12/19/18 5:33 PM, Gael Varoquaux wrote: >????? >> > I would propose? the week of Feb 25th, as I heard people say >???? that they >????? >> > might be available at this time. It is good for many people, >???? or should we >????? >> > organize a doodle? >????? >> > >????? >> > G >????? >> > >????? >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: >????? >> >> Can we please nail down dates for a sprint? >????? >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote: >????? >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: >????? >> >>>> We can also do Paris in April / May or June if that's ok >???? with Joel and better >????? >> >>>> for Andreas. >????? >> >>> Absolutely. >????? >> >>> My thoughts here are that I want to minimize transportation, >???? partly >????? >> >>> because flying has a large carbon footprint. Also, for >???? personal reasons, >????? >> >>> I am not sure that I will be able to make it to Austin in >???? July, but I >????? >> >>> realize that this is a pretty bad argument. >????? >> >>> We're happy to try to host in Paris whenever it's most >???? convenient and to >????? >> >>> try to help with travel for those not in Paris. >????? >> >>> Ga?l >????? >> >>> _______________________________________________ >????? >> >>> scikit-learn mailing list >????? >> >>> scikit-learn at python.org >????? >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >????? >> >> _______________________________________________ >????? >> >> scikit-learn mailing list >????? >> >> scikit-learn at python.org >????? >> >> https://mail.python.org/mailman/listinfo/scikit-learn >????? >> >????? >> _______________________________________________ >????? >> scikit-learn mailing list >????? >> scikit-learn at python.org >????? >> https://mail.python.org/mailman/listinfo/scikit-learn >????? > >????? > _______________________________________________ >????? > scikit-learn mailing list >????? > scikit-learn at python.org >????? > https://mail.python.org/mailman/listinfo/scikit-learn >???? _______________________________________________ >???? scikit-learn mailing list >???? scikit-learn at python.org >???? https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn From pahome.chen at mirlab.org Mon Dec 24 22:15:42 2018 From: pahome.chen at mirlab.org (lampahome) Date: Tue, 25 Dec 2018 11:15:42 +0800 Subject: [scikit-learn] Any way to tune the parameters better than GridSearchCV? Message-ID: Take random forest as example, if I give estimator from 10 to 10000(10, 100, 1000, 10000) into grid search. Based on the result, I found estimator=100 is the best, but I don't know lower or greater than 100 is better. How should I decide? brute force or any tools better than GridSearchCV? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Mon Dec 24 22:27:01 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Tue, 25 Dec 2018 12:27:01 +0900 Subject: [scikit-learn] Any way to tune the parameters better than GridSearchCV? In-Reply-To: References: Message-ID: > Take random forest as example, if I give estimator from 10 to 10000(10, > 100, 1000, 10000) into grid search. > Based on the result, I found estimator=100 is the best, but I don't know > lower or greater than 100 is better. > How should I decide? brute force or any tools better than GridSearchCV? > A simple but nonetheless practical solution is to (1) start with an upper bound on the number of trees you are willing to accept in the model, (2) obtain its performance (ACC, MCC, F1, etc) as the starting reference point, (3) systematically lower the number of trees (log2 scale down, fixed size decrement, etc) (4) obtain the reduced forest size performance, (5) Repeat (3)-(4) until [performance(reference) - performance(current forest size)] > tolerance You can encapsulate that in a function which then returns the final model you obtain. >From the model object, the number of trees can be obtained. J.B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Mon Dec 24 23:15:01 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Mon, 24 Dec 2018 22:15:01 -0600 Subject: [scikit-learn] Any way to tune the parameters better than GridSearchCV? In-Reply-To: References: Message-ID: <2B6DACD8-F35D-4C3B-BB4E-601F5E75BACE@sebastianraschka.com> I would like to make a related suggestion but instead of focusing on the upper bound for the number of trees rather set choosing the lower bound. From a theoretical perspective, it doesn't make sense to me how fewer trees can result in a better performing random forest model in terms of generalization performance. If you observe a better performance on the same independent test set with fewer trees, I would say that this is likely not a good indicator of better generalization performance. It could be due to overfitting and train/test set resampling and/or picking up artifacts in the dataset. As a general suggestion, I would suggest choosing a reasonable number of trees that seems computationally feasible given the size of the dataset and the number hyperparameters to compare via model selection. Then, after tuning, I would use the best hyperparameter setting with 10x more trees and see if you notice any significant different in the cross-validation performance. Next, I would use the model and fit it to the whole training set with those best hyperparameters and evaluate the performance on the independent test set. Best, Sebastian > On Dec 24, 2018, at 9:27 PM, Brown J.B. via scikit-learn wrote: > > Take random forest as example, if I give estimator from 10 to 10000(10, 100, 1000, 10000) into grid search. > Based on the result, I found estimator=100 is the best, but I don't know lower or greater than 100 is better. > How should I decide? brute force or any tools better than GridSearchCV? > > A simple but nonetheless practical solution is to > (1) start with an upper bound on the number of trees you are willing to accept in the model, > (2) obtain its performance (ACC, MCC, F1, etc) as the starting reference point, > (3) systematically lower the number of trees (log2 scale down, fixed size decrement, etc) > (4) obtain the reduced forest size performance, > (5) Repeat (3)-(4) until [performance(reference) - performance(current forest size)] > tolerance > > You can encapsulate that in a function which then returns the final model you obtain. > From the model object, the number of trees can be obtained. > > J.B. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From pahome.chen at mirlab.org Wed Dec 26 04:26:40 2018 From: pahome.chen at mirlab.org (lampahome) Date: Wed, 26 Dec 2018 17:26:40 +0800 Subject: [scikit-learn] How to grab subsets from train sets when bootstrap=False in RF regressor? Message-ID: As title RF regressor decide a tree by grabing part of train data aka bootstrap. If set bootstrap=False, how would the model grab data? The reason I'm interesting is when I set it to False, it makes the mse and mae down, that's means False is better. -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Dec 27 16:33:10 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 27 Dec 2018 16:33:10 -0500 Subject: [scikit-learn] How to grab subsets from train sets when bootstrap=False in RF regressor? In-Reply-To: References: Message-ID: It uses all the data. On 12/26/18 4:26 AM, lampahome wrote: > As title > > RF regressor decide a tree by grabing part of train data aka bootstrap. > > If set bootstrap=False, how would the model grab data? > > The reason I'm interesting is when I set it to False, it makes the mse > and mae down, that's means False is better. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Dec 27 17:59:59 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 27 Dec 2018 17:59:59 -0500 Subject: [scikit-learn] Draft of a Scikit-learn governance document Message-ID: <1a1c9e5f-389c-f5f9-1552-a71a5513ff96@gmail.com> Hi all. I just posted a proposal for a scikit-learn governance document as a PR: https://github.com/scikit-learn/scikit-learn/pull/12878 The core devs discussed this already to some degrees but I think it would be great to involve the greater community in finalizing this. Any feedback is welcome. Cheers, Andy From joel.nothman at gmail.com Mon Dec 31 20:26:41 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 1 Jan 2019 12:26:41 +1100 Subject: [scikit-learn] ANN: Scikit-learn 0.20.2 released Message-ID: A bug fix release of scikit-learn, version 0.20.2, was released a couple of weeks ago. It is not yet on Conda default channel, but should be available on pypi and conda-forge. Thank you to all who contributed. As well as the changes listed at https://scikit-learn.org/0.20/whats_new.html#version-0-20-2 and documentation improvements, we also corrected an error in packaging the source distribution for the previous release: we have made sure to use the latest cython this time. We still anticipate that there will be a further release in the 0.20 series to fix regressions from 0.19 to 0.20. Happy new year, and happy learning! The scikit-learn developer team -------------- next part -------------- An HTML attachment was scrubbed... URL: From qinhanmin2005 at sina.com Mon Dec 31 22:16:50 2018 From: qinhanmin2005 at sina.com (Hanmin Qin) Date: Tue, 01 Jan 2019 11:16:50 +0800 Subject: [scikit-learn] ANN: Scikit-learn 0.20.2 released Message-ID: <20190101031651.1066E4140092@webmail.sinamail.sina.com.cn> 0.20.2 is now available on conda default channel. Happy new year to everyone! The scikit-learn developer team ----- Original Message ----- From: Joel Nothman To: Scikit-learn user and developer mailing list Subject: [scikit-learn] ANN: Scikit-learn 0.20.2 released Date: 2019-01-01 09:28 A bug fix release of scikit-learn, version 0.20.2, was released a couple of weeks ago. It is not yet on Conda default channel, but should be available on pypi and conda-forge. Thank you to all who contributed. As well as the changes listed at https://scikit-learn.org/0.20/whats_new.html#version-0-20-2 and documentation improvements, we also corrected an error in packaging the source distribution for the previous release: we have made sure to use the latest cython this time. We still anticipate that there will be a further release in the 0.20 series to fix regressions from 0.19 to 0.20. Happy new year, and happy learning! The scikit-learn developer team _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: