From ay.j at hotmail.fr Thu Jun 1 13:09:19 2017 From: ay.j at hotmail.fr (Aymen J) Date: Thu, 1 Jun 2017 17:09:19 +0000 Subject: [scikit-learn] Ipython Jupyter Kernel Dies when I fit an SGDClassifier Message-ID: Hey Guys, So I'm trying to fit an SGD classifier on a dataset that has 900,000 for about 3,600 features (high cardinality). Here is my model: model = SGDClassifier(loss='log',penalty=None,alpha=0.0, l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False,learning_rate='constant', eta0=1.0) When I run the model.fit function, The program runs for about 5 minutes, and I receive the message "the kernel has died" from Jupyter. Any idea what may cause that? Is my training data too big (in terms of features)? Can I do anything (parameters) to finish training? Thanks in advance for your help! -------------- next part -------------- An HTML attachment was scrubbed... URL: From ivanvallesperez at gmail.com Fri Jun 2 06:50:35 2017 From: ivanvallesperez at gmail.com (=?UTF-8?B?SXbDoW4gVmFsbMOpcyBQw6lyZXo=?=) Date: Fri, 02 Jun 2017 10:50:35 +0000 Subject: [scikit-learn] Ipython Jupyter Kernel Dies when I fit an SGDClassifier In-Reply-To: References: Message-ID: Are you monitoring your RAM memory consumption? I would say that it is the cause of the majority of the kernel crashes El El vie, 2 jun 2017 a las 12:45, Aymen J escribi?: > Hey Guys, > > > So I'm trying to fit an SGD classifier on a dataset that has 900,000 for > about 3,600 features (high cardinality). > > > Here is my model: > > model = SGDClassifier(loss='log',penalty=None,alpha=0.0, > > l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False,learning_rate='constant', > eta0=1.0) > > When I run the model.fit function, The program runs for about 5 minutes, > and I receive the message "the kernel has died" from Jupyter. > > Any idea what may cause that? Is my training data too big (in terms of > features)? Can I do anything (parameters) to finish training? > > Thanks in advance for your help! > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Fri Jun 2 13:30:30 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Fri, 2 Jun 2017 13:30:30 -0400 Subject: [scikit-learn] Ipython Jupyter Kernel Dies when I fit an SGDClassifier In-Reply-To: References: Message-ID: <1DB52756-97A2-4629-B163-36BA3E51C034@gmail.com> I also think that this could be likely a memory related issue. I just ran the following snippet in a Jupyter Nb: import numpy as np from sklearn.linear_model import SGDClassifier model = SGDClassifier(loss='log',penalty=None,alpha=0.0, l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False,learning_rate='constant', eta0=1.0) X = np.random.random((1000000, 1000)) y = np.zeros(1000000) y[:1000] = 1 model.fit(X, y) The dataset takes approx. 8 Gb, but the model fitting is consuming ~16 Gb -- probably due to making a copy of the X array in the code. The Notebook didn't crash but I think on machines with smaller RAM, this could be an issue. One workaround you could try is to fit the model iteratively using partial_fit. For example, 1000 samples at a time or so: indices = np.arange(y.shape[0]) batch_size = 1000 for start_idx in range(0, indices.shape[0] - batch_size + 1, batch_size): index_slice = indices[start_idx:start_idx + batch_size] model.partial_fit(X[index_slice], y[index_slice], classes=[0, 1]) Best, Sebastian > On Jun 2, 2017, at 6:50 AM, Iv?n Vall?s P?rez wrote: > > Are you monitoring your RAM memory consumption? I would say that it is the cause of the majority of the kernel crashes > El El vie, 2 jun 2017 a las 12:45, Aymen J escribi?: > Hey Guys, > > > So I'm trying to fit an SGD classifier on a dataset that has 900,000 for about 3,600 features (high cardinality). > > > Here is my model: > > > model = SGDClassifier(loss='log',penalty=None,alpha=0.0, > l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False,learning_rate='constant', > eta0=1.0) > > When I run the model.fit function, The program runs for about 5 minutes, and I receive the message "the kernel has died" from Jupyter. > > Any idea what may cause that? Is my training data too big (in terms of features)? Can I do anything (parameters) to finish training? > > Thanks in advance for your help! > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From stuart at stuartreynolds.net Fri Jun 2 13:39:48 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Fri, 2 Jun 2017 10:39:48 -0700 Subject: [scikit-learn] Ipython Jupyter Kernel Dies when I fit an SGDClassifier In-Reply-To: <1DB52756-97A2-4629-B163-36BA3E51C034@gmail.com> References: <1DB52756-97A2-4629-B163-36BA3E51C034@gmail.com> Message-ID: Hmmm... is it possible to place your original data into a memmap? (perhaps will clear out 8Gb, depending on SGDClassifier internals?) https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas - Stuart On Fri, Jun 2, 2017 at 10:30 AM, Sebastian Raschka wrote: > I also think that this could be likely a memory related issue. I just ran the following snippet in a Jupyter Nb: > > import numpy as np > from sklearn.linear_model import SGDClassifier > > model = SGDClassifier(loss='log',penalty=None,alpha=0.0, > l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False,learning_rate='constant', > eta0=1.0) > > X = np.random.random((1000000, 1000)) > y = np.zeros(1000000) > y[:1000] = 1 > > model.fit(X, y) > > > > The dataset takes approx. 8 Gb, but the model fitting is consuming ~16 Gb -- probably due to making a copy of the X array in the code. The Notebook didn't crash but I think on machines with smaller RAM, this could be an issue. One workaround you could try is to fit the model iteratively using partial_fit. For example, 1000 samples at a time or so: > > > indices = np.arange(y.shape[0]) > batch_size = 1000 > > for start_idx in range(0, indices.shape[0] - batch_size + 1, > batch_size): > index_slice = indices[start_idx:start_idx + batch_size] > model.partial_fit(X[index_slice], y[index_slice], classes=[0, 1]) > > > > Best, > Sebastian > > >> On Jun 2, 2017, at 6:50 AM, Iv?n Vall?s P?rez wrote: >> >> Are you monitoring your RAM memory consumption? I would say that it is the cause of the majority of the kernel crashes >> El El vie, 2 jun 2017 a las 12:45, Aymen J escribi?: >> Hey Guys, >> >> >> So I'm trying to fit an SGD classifier on a dataset that has 900,000 for about 3,600 features (high cardinality). >> >> >> Here is my model: >> >> >> model = SGDClassifier(loss='log',penalty=None,alpha=0.0, >> l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False,learning_rate='constant', >> eta0=1.0) >> >> When I run the model.fit function, The program runs for about 5 minutes, and I receive the message "the kernel has died" from Jupyter. >> >> Any idea what may cause that? Is my training data too big (in terms of features)? Can I do anything (parameters) to finish training? >> >> Thanks in advance for your help! >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ay.j at hotmail.fr Fri Jun 2 07:01:21 2017 From: ay.j at hotmail.fr (Aymen J) Date: Fri, 2 Jun 2017 11:01:21 +0000 Subject: [scikit-learn] Ipython Jupyter Kernel Dies when I fit an SGDClassifier In-Reply-To: References: , Message-ID: Thanks for the answer. Not really. How can I do that? Sent from my iPhone On Jun 2, 2017, at 12:51 PM, Iv?n Vall?s P?rez > wrote: Are you monitoring your RAM memory consumption? I would say that it is the cause of the majority of the kernel crashes El El vie, 2 jun 2017 a las 12:45, Aymen J > escribi?: Hey Guys, So I'm trying to fit an SGD classifier on a dataset that has 900,000 for about 3,600 features (high cardinality). Here is my model: model = SGDClassifier(loss='log',penalty=None,alpha=0.0, l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False,learning_rate='constant', eta0=1.0) When I run the model.fit function, The program runs for about 5 minutes, and I receive the message "the kernel has died" from Jupyter. Any idea what may cause that? Is my training data too big (in terms of features)? Can I do anything (parameters) to finish training? Thanks in advance for your help! _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.violante at gmail.com Sat Jun 3 15:58:10 2017 From: sean.violante at gmail.com (Sean Violante) Date: Sat, 3 Jun 2017 21:58:10 +0200 Subject: [scikit-learn] Ipython Jupyter Kernel Dies when I fit an SGDClassifier In-Reply-To: References: <1DB52756-97A2-4629-B163-36BA3E51C034@gmail.com> Message-ID: Have you used sparse arrays? On Fri, Jun 2, 2017 at 7:39 PM, Stuart Reynolds wrote: > Hmmm... is it possible to place your original data into a memmap? > (perhaps will clear out 8Gb, depending on SGDClassifier internals?) > > https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html > https://stackoverflow.com/questions/14262433/large-data- > work-flows-using-pandas > > - Stuart > > On Fri, Jun 2, 2017 at 10:30 AM, Sebastian Raschka > wrote: > > I also think that this could be likely a memory related issue. I just > ran the following snippet in a Jupyter Nb: > > > > import numpy as np > > from sklearn.linear_model import SGDClassifier > > > > model = SGDClassifier(loss='log',penalty=None,alpha=0.0, > > l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False, > learning_rate='constant', > > eta0=1.0) > > > > X = np.random.random((1000000, 1000)) > > y = np.zeros(1000000) > > y[:1000] = 1 > > > > model.fit(X, y) > > > > > > > > The dataset takes approx. 8 Gb, but the model fitting is consuming ~16 > Gb -- probably due to making a copy of the X array in the code. The > Notebook didn't crash but I think on machines with smaller RAM, this could > be an issue. One workaround you could try is to fit the model iteratively > using partial_fit. For example, 1000 samples at a time or so: > > > > > > indices = np.arange(y.shape[0]) > > batch_size = 1000 > > > > for start_idx in range(0, indices.shape[0] - batch_size + 1, > > batch_size): > > index_slice = indices[start_idx:start_idx + batch_size] > > model.partial_fit(X[index_slice], y[index_slice], classes=[0, 1]) > > > > > > > > Best, > > Sebastian > > > > > >> On Jun 2, 2017, at 6:50 AM, Iv?n Vall?s P?rez < > ivanvallesperez at gmail.com> wrote: > >> > >> Are you monitoring your RAM memory consumption? I would say that it is > the cause of the majority of the kernel crashes > >> El El vie, 2 jun 2017 a las 12:45, Aymen J escribi?: > >> Hey Guys, > >> > >> > >> So I'm trying to fit an SGD classifier on a dataset that has 900,000 > for about 3,600 features (high cardinality). > >> > >> > >> Here is my model: > >> > >> > >> model = SGDClassifier(loss='log',penalty=None,alpha=0.0, > >> l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False, > learning_rate='constant', > >> eta0=1.0) > >> > >> When I run the model.fit function, The program runs for about 5 > minutes, and I receive the message "the kernel has died" from Jupyter. > >> > >> Any idea what may cause that? Is my training data too big (in terms of > features)? Can I do anything (parameters) to finish training? > >> > >> Thanks in advance for your help! > >> > >> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rain.vagel at gmail.com Sun Jun 4 17:04:05 2017 From: rain.vagel at gmail.com (Rain Vagel) Date: Mon, 5 Jun 2017 00:04:05 +0300 Subject: [scikit-learn] Cross-validation & cross-testing Message-ID: <6CB4B416-5625-41A8-BB1C-3E89DB968088@gmail.com> Hey, I am a bachelor?s student and for my thesis I implemented a cross-testing function in a scikit-learn compatible way and published it on Github. The paper on which I based my own thesis can be found here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0161788 . My project can be found here: https://github.com/RainVagel/cross-val-cross-test . Our original plan was to try and get the algorithm into scikit-learn, but it doesn?t meet the requirements yet. So instead we thought about maybe having it listed in the ?Related Projects? page. Is it possible for somebody to take a look and give any feedback? Sincerely, Rain -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Sun Jun 4 18:29:04 2017 From: tmrsg11 at gmail.com (C W) Date: Sun, 4 Jun 2017 18:29:04 -0400 Subject: [scikit-learn] How to best understand scikit-learn and know its modules and methods? Message-ID: Dear scikit learn list, I am new to scikit-learn. I am getting confused about LinearRegression. For example, from sklearn.datasets import load_boston from sklearn.linear_model import LinearRegression boston = load_boston() X = boston.data y = boston.target model1 = LinearRegression() model1.fit(X, y) print(model.coef) I got a few questions: 1) When I do model1.fit(X, y), don't I have to save it? Does object model1 automatically gets trained/updated? Since I don't see any output, how do I know what has been done to the model1? 2) Is there a command to see what's masked under sklearn, like sklearn.datasets, sklearn.linear_model, and all of it? 3) Why do we need load_boston() to load boston data? I thought we just imported it, so it should be ready to use. Thank you very much! Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Sun Jun 4 18:40:10 2017 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Mon, 05 Jun 2017 00:40:10 +0200 Subject: [scikit-learn] How to best understand scikit-learn and know its modules and methods? In-Reply-To: References: Message-ID: <20170604224010.4878413.74178.32845@gmail.com> An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Sun Jun 4 19:06:28 2017 From: tmrsg11 at gmail.com (C W) Date: Sun, 4 Jun 2017 19:06:28 -0400 Subject: [scikit-learn] How to best understand scikit-learn and know its modules and methods? In-Reply-To: <20170604224010.4878413.74178.32845@gmail.com> References: <20170604224010.4878413.74178.32845@gmail.com> Message-ID: Yes, they make a lot sense. Thanks! I wanted to ask a follow-up: > LinearRegression().fit(X, y) When I do this, where is everything saved? Or does it disappear after I run it? Thank you! On Sun, Jun 4, 2017 at 6:40 PM, Guillaume Lemaitre wrote: > Hope it helps. I answered in the original message > > G > *From: *C W > *Sent: *Monday, 5 June 2017 00:31 > *To: *scikit-learn at python.org > *Reply To: *Scikit-learn user and developer mailing list > *Subject: *[scikit-learn] How to best understand scikit-learn and know > its modules and methods? > > Dear scikit learn list, > > I am new to scikit-learn. I am getting confused about LinearRegression. > > For example, > from sklearn.datasets import load_boston > from sklearn.linear_model import LinearRegression > boston = load_boston() > X = boston.data > y = boston.target > model1 = LinearRegression() > model1.fit(X, y) > print(model.coef) > > I got a few questions: > 1) When I do model1.fit(X, y), don't I have to save it? Does object model1 > automatically gets trained/updated? Since I don't see any output, how do I > know what has been done to the model1? > > The model has been fitted (trained in place). model1 will contain all info > learnt directly. In addition, the output will be a fitted model1 because > fit return self. Normally, model1.fit(X,y) will print LinearRegression(...) > > 2) Is there a command to see what's masked under sklearn, like > sklearn.datasets, sklearn.linear_model, and all of it? > > You can check the documentation API. I think that this is the best user > friendly thing that you can start with. > > 3) Why do we need load_boston() to load boston data? I thought we just > imported it, so it should be ready to use. > > Load_boston() is a helper function which will load the data. Importing > load_boston will import the function not the data. Calling the imported > function will load the data. > > Thank you very much! > > Mike > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Jun 4 21:52:22 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 5 Jun 2017 11:52:22 +1000 Subject: [scikit-learn] Cross-validation & cross-testing In-Reply-To: <6CB4B416-5625-41A8-BB1C-3E89DB968088@gmail.com> References: <6CB4B416-5625-41A8-BB1C-3E89DB968088@gmail.com> Message-ID: Hi Rain, I would suggest that you start by documenting what your code is meant to do (the structure of the Korjus et al paper makes it pretty difficult to even determine what this technique is, for you to then not to describe it in your own words in your repository), testing it with diverse inputs and ensuring that it is correct. At a glance I can see at least two sources of bugs, and some API design choices which I think could be improved. Cheers, Joel On 5 June 2017 at 07:04, Rain Vagel wrote: > Hey, > > I am a bachelor?s student and for my thesis I implemented a cross-testing > function in a scikit-learn compatible way and published it on Github. The > paper on which I based my own thesis can be found here: > http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0161788. > > My project can be found here: https://github.com/RainVagel/ > cross-val-cross-test. > > Our original plan was to try and get the algorithm into scikit-learn, but > it doesn?t meet the requirements yet. So instead we thought about maybe > having it listed in the ?Related Projects? page. Is it possible for > somebody to take a look and give any feedback? > > Sincerely, > Rain > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Jun 4 21:53:43 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 5 Jun 2017 11:53:43 +1000 Subject: [scikit-learn] Cross-validation & cross-testing In-Reply-To: References: <6CB4B416-5625-41A8-BB1C-3E89DB968088@gmail.com> Message-ID: And when I mean testing it, I mean writing tests that live with the code so that they can be re-executed, and so that someone else can see what your tests assert about your code's correctness. On 5 June 2017 at 11:52, Joel Nothman wrote: > Hi Rain, > > I would suggest that you start by documenting what your code is meant to > do (the structure of the Korjus et al paper makes it pretty difficult to > even determine what this technique is, for you to then not to describe it > in your own words in your repository), testing it with diverse inputs and > ensuring that it is correct. At a glance I can see at least two sources of > bugs, and some API design choices which I think could be improved. > > Cheers, > > Joel > > On 5 June 2017 at 07:04, Rain Vagel wrote: > >> Hey, >> >> I am a bachelor?s student and for my thesis I implemented a cross-testing >> function in a scikit-learn compatible way and published it on Github. The >> paper on which I based my own thesis can be found here: >> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0161788. >> >> My project can be found here: https://github.com/RainVagel/c >> ross-val-cross-test. >> >> Our original plan was to try and get the algorithm into scikit-learn, but >> it doesn?t meet the requirements yet. So instead we thought about maybe >> having it listed in the ?Related Projects? page. Is it possible for >> somebody to take a look and give any feedback? >> >> Sincerely, >> Rain >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Jun 4 23:27:16 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 4 Jun 2017 23:27:16 -0400 Subject: [scikit-learn] Cross-validation & cross-testing In-Reply-To: References: <6CB4B416-5625-41A8-BB1C-3E89DB968088@gmail.com> Message-ID: > Is it possible for somebody to take a look and give any feedback? Just looked over your repo would have some feedback: Definitely cite the original research paper that your implementation is based on. Right now it just says "The cross-validation & cross-testing method was developed by Korjus et al." (year, journal, title, ... are missing) Like Joel mentioned, I'd add unit tests and also consider CI services like Travis to check if the code indeed works (produces the same results) for package versions newer than the one you listed since you use ">=" Maybe a good, explanatory figure would help -- often, a good figure can make things much more clear and intuitive for a user. For new algorithms, it is also helpful to explain them in a procedural way using a numeric list of steps. In addition to describing the package, also consider stating the problem this approach is going to address. Just a few general comment on the paper (which I only skimmed over I have to admit). Not sure what to think of this, it might be an interesting idea, but showing empirical results on only 2 datasets and a simulated one does not convince me that this is useful in practice, yet. Also, a discussion/analysis on bias and variance seems to be missing from that paper. Another thing is that I think in practice, one would also consider LOOCV or bootstrap approaches for "very" small datasets, which is not even mentioned in this paper. While I think there might be some interesting idea here, I'd say there needs to be additional research to make a judgement whether this approach should be used in practice or not -- I would say it's a bit too early too include something like this in scikit-learn? Best, Sebastian > On Jun 4, 2017, at 9:53 PM, Joel Nothman wrote: > > And when I mean testing it, I mean writing tests that live with the code so that they can be re-executed, and so that someone else can see what your tests assert about your code's correctness. > > On 5 June 2017 at 11:52, Joel Nothman wrote: > Hi Rain, > > I would suggest that you start by documenting what your code is meant to do (the structure of the Korjus et al paper makes it pretty difficult to even determine what this technique is, for you to then not to describe it in your own words in your repository), testing it with diverse inputs and ensuring that it is correct. At a glance I can see at least two sources of bugs, and some API design choices which I think could be improved. > > Cheers, > > Joel > > On 5 June 2017 at 07:04, Rain Vagel wrote: > Hey, > > I am a bachelor?s student and for my thesis I implemented a cross-testing function in a scikit-learn compatible way and published it on Github. The paper on which I based my own thesis can be found here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0161788. > > My project can be found here: https://github.com/RainVagel/cross-val-cross-test. > > Our original plan was to try and get the algorithm into scikit-learn, but it doesn?t meet the requirements yet. So instead we thought about maybe having it listed in the ?Related Projects? page. Is it possible for somebody to take a look and give any feedback? > > Sincerely, > Rain > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From jmschreiber91 at gmail.com Mon Jun 5 00:49:06 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sun, 4 Jun 2017 21:49:06 -0700 Subject: [scikit-learn] How to best understand scikit-learn and know its modules and methods? In-Reply-To: References: <20170604224010.4878413.74178.32845@gmail.com> Message-ID: Everything will disappear if you don't save it. However, if you do ```clf = LinearRegression().fit(X, y)``` then the model is saved in the variable `clf`. On Sun, Jun 4, 2017 at 4:06 PM, C W wrote: > Yes, they make a lot sense. Thanks! > > I wanted to ask a follow-up: > > > LinearRegression().fit(X, y) > When I do this, where is everything saved? Or does it disappear after I > run it? > > Thank you! > > On Sun, Jun 4, 2017 at 6:40 PM, Guillaume Lemaitre > wrote: > >> Hope it helps. I answered in the original message >> >> G >> *From: *C W >> *Sent: *Monday, 5 June 2017 00:31 >> *To: *scikit-learn at python.org >> *Reply To: *Scikit-learn user and developer mailing list >> *Subject: *[scikit-learn] How to best understand scikit-learn and know >> its modules and methods? >> >> Dear scikit learn list, >> >> I am new to scikit-learn. I am getting confused about LinearRegression. >> >> For example, >> from sklearn.datasets import load_boston >> from sklearn.linear_model import LinearRegression >> boston = load_boston() >> X = boston.data >> y = boston.target >> model1 = LinearRegression() >> model1.fit(X, y) >> print(model.coef) >> >> I got a few questions: >> 1) When I do model1.fit(X, y), don't I have to save it? Does object >> model1 automatically gets trained/updated? Since I don't see any output, >> how do I know what has been done to the model1? >> >> The model has been fitted (trained in place). model1 will contain all >> info learnt directly. In addition, the output will be a fitted model1 >> because fit return self. Normally, model1.fit(X,y) will print >> LinearRegression(...) >> >> 2) Is there a command to see what's masked under sklearn, like >> sklearn.datasets, sklearn.linear_model, and all of it? >> >> You can check the documentation API. I think that this is the best user >> friendly thing that you can start with. >> >> 3) Why do we need load_boston() to load boston data? I thought we just >> imported it, so it should be ready to use. >> >> Load_boston() is a helper function which will load the data. Importing >> load_boston will import the function not the data. Calling the imported >> function will load the data. >> >> Thank you very much! >> >> Mike >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Mon Jun 5 10:46:27 2017 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Mon, 5 Jun 2017 23:46:27 +0900 Subject: [scikit-learn] Random Forest max_features and boostrap construction parameters interpretation Message-ID: Dear community, This is a question regarding how to interpret the documentation and semantics of the random forest constructors. In forest.py (of version 0.17 which I am still using), the documentation regarding the number of features to consider states on lines 742-745 of the source code that the search may effectively inspect more than `max_features` when determining the features to pick from in order to split a node. It also states that it is tree specific. Am I correct in: Interpretation #1 - For bootstrap=True, sampling with replacement occurs for the number of training instances available, meaning that the subsample presented to a particular tree will have some probability of containing overlaps and therefore not the full input training set, but for bootstrap=False, the entire dataset will be presented to each tree? Interpretation #2 - Particularly, with the way I interpret the documentation stating that "The sub-sample size is always the same as the original input sample size...", it seems to me that bootstrap=False then provides the entire training dataset to each decision tree, and it is a matter of which feature was randomly selected first from the features given that determines what the tree will become. That would suggest that, if bootstrap=False, and if the number of trees is high but the feature dimensionality is very low, then there is a high possibility that multiple copies of the same tree will emerge from the forest. Interpretation #3 - the feature subset is not subsampled per tree, but rather all features are presented for the subsampled training data provided to a tree ? For example, if the dimensionality is 400 on a 6000-input training dataset that has randomly been subsampled (with bootstrap=True) to yield 4700 unique training samples, then the tree builder will consider all 400 dimensions/features with respect to the 4700 samples, picking at most `max_features` number of features (out of 400) for building splits in the tree? So by default (sqrt/auto), there would be at most 20 splits in the tree? Confirmations, denials, and corrections to my interpretations are _highly_ welcome. As always, my great thanks to the community. With kind regards, J.B. Brown Kyoto University Graduate School of Medicine -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Mon Jun 5 13:54:57 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 5 Jun 2017 10:54:57 -0700 Subject: [scikit-learn] Random Forest max_features and boostrap construction parameters interpretation In-Reply-To: References: Message-ID: Howdy When doing bootstrapping, n samples are selected from the dataset WITH replacement, where n is the number of samples in the dataset. This leads to situations where some samples have a weight > 1 and others have a weight of 0. This is done separately for each tree. When selecting the number of features, this should be considered more like `max_informative_features.` Essentially, if a tree considers splitting a feature that is constant, that won't count against the `max_features` threshold that is set. This helps guard against situations where many uninformative trees are built because the dataset is full of uninformative features. You can see this in the code here: ( https://github.com/scikit-learn/scikit-learn/blob/14031f65d144e3966113d3daec836e443c6d7a5b/sklearn/tree/_splitter.pyx#L361). This is done on a -per split basis-, meaning that a tree can have more than `max_features` number of features considered. In your example, it is not that there would be at most 20 splits in a tree, it is that at each split only 20 informative features would be considered. You can split on a feature multiple times (consider the example where you have one features and x < 0 is class 0, 0 <= x <= 10 is class 1, and x > 10 is class 0 again). Let me know if you have any other questions! On Mon, Jun 5, 2017 at 7:46 AM, Brown J.B. wrote: > Dear community, > > This is a question regarding how to interpret the documentation and > semantics of the random forest constructors. > > In forest.py (of version 0.17 which I am still using), the documentation > regarding the number of features to consider states on lines 742-745 of the > source code that the search may effectively inspect more than > `max_features` when determining the features to pick from in order to split > a node. > It also states that it is tree specific. > > Am I correct in: > > Interpretation #1 - For bootstrap=True, sampling with replacement occurs > for the number of training instances available, meaning that the subsample > presented to a particular tree will have some probability of containing > overlaps and therefore not the full input training set, but for > bootstrap=False, the entire dataset will be presented to each tree? > > Interpretation #2 - Particularly, with the way I interpret the > documentation stating that "The sub-sample size is always the same as the > original input sample size...", it seems to me that bootstrap=False then > provides the entire training dataset to each decision tree, and it is a > matter of which feature was randomly selected first from the features given > that determines what the tree will become. > That would suggest that, if bootstrap=False, and if the number of trees is > high but the feature dimensionality is very low, then there is a high > possibility that multiple copies of the same tree will emerge from the > forest. > > Interpretation #3 - the feature subset is not subsampled per tree, but > rather all features are presented for the subsampled training data provided > to a tree ? For example, if the dimensionality is 400 on a 6000-input > training dataset that has randomly been subsampled (with bootstrap=True) to > yield 4700 unique training samples, then the tree builder will consider all > 400 dimensions/features with respect to the 4700 samples, picking at most > `max_features` number of features (out of 400) for building splits in the > tree? So by default (sqrt/auto), there would be at most 20 splits in the > tree? > > Confirmations, denials, and corrections to my interpretations are _highly_ > welcome. > > As always, my great thanks to the community. > > With kind regards, > J.B. Brown > Kyoto University Graduate School of Medicine > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Tue Jun 6 01:02:21 2017 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Tue, 6 Jun 2017 14:02:21 +0900 Subject: [scikit-learn] Random Forest max_features and boostrap construction parameters interpretation In-Reply-To: References: Message-ID: Dear Jacob, Thank you for this clarification. It is a great help in interpreting the (good) results that we are obtaining for computational chemogenomics, and also help in deciding directions of future studies. Perhaps then, the random forest documentation (description web page) could be updated to reflect our discussion, in that it might help others who have the same questions of interpretation. Perhaps, we can add the following (with notation symbols corrected to match sklearn standards): ---------- In general, for a modeling problem with N training instances each having F features, a random forest of T trees operates by building T decision trees such that each tree is provided a subsampling of N instances and the F features for those subsampled instances. When bootstrapping is applied, the instance subsampling can potentially choose the same instance multiple times, in which such an instance will have elevated weighting. When bootstrapping is not applied, the entire training set is provided to the tree-building algorithm. Each tree is built by considering a maximum specified number of informative features at each decision node, such that features with no variance are excluded from the features to consider for a split and do not count toward the number of informative features. Splitting on the informative features can occur as many times as necessary, unless a maximum depth is specified in the constructor. Note that an informative feature can be re-applied to form a decision criteria at more than node in the decision tree. ---------- Adjustments welcome. Many thanks again! J.B. 2017-06-06 2:54 GMT+09:00 Jacob Schreiber : > Howdy > > When doing bootstrapping, n samples are selected from the dataset WITH > replacement, where n is the number of samples in the dataset. This leads to > situations where some samples have a weight > 1 and others have a weight of > 0. This is done separately for each tree. > > When selecting the number of features, this should be considered more like > `max_informative_features.` Essentially, if a tree considers splitting a > feature that is constant, that won't count against the `max_features` > threshold that is set. This helps guard against situations where many > uninformative trees are built because the dataset is full of uninformative > features. You can see this in the code here: (https://github.com/scikit- > learn/scikit-learn/blob/14031f65d144e3966113d3daec836e > 443c6d7a5b/sklearn/tree/_splitter.pyx#L361). This is done on a -per split > basis-, meaning that a tree can have more than `max_features` number of > features considered. > > In your example, it is not that there would be at most 20 splits in a > tree, it is that at each split only 20 informative features would be > considered. You can split on a feature multiple times (consider the example > where you have one features and x < 0 is class 0, 0 <= x <= 10 is class 1, > and x > 10 is class 0 again). > > Let me know if you have any other questions! > > On Mon, Jun 5, 2017 at 7:46 AM, Brown J.B. > wrote: > >> Dear community, >> >> This is a question regarding how to interpret the documentation and >> semantics of the random forest constructors. >> >> In forest.py (of version 0.17 which I am still using), the documentation >> regarding the number of features to consider states on lines 742-745 of the >> source code that the search may effectively inspect more than >> `max_features` when determining the features to pick from in order to split >> a node. >> It also states that it is tree specific. >> >> Am I correct in: >> >> Interpretation #1 - For bootstrap=True, sampling with replacement occurs >> for the number of training instances available, meaning that the subsample >> presented to a particular tree will have some probability of containing >> overlaps and therefore not the full input training set, but for >> bootstrap=False, the entire dataset will be presented to each tree? >> >> Interpretation #2 - Particularly, with the way I interpret the >> documentation stating that "The sub-sample size is always the same as the >> original input sample size...", it seems to me that bootstrap=False then >> provides the entire training dataset to each decision tree, and it is a >> matter of which feature was randomly selected first from the features given >> that determines what the tree will become. >> That would suggest that, if bootstrap=False, and if the number of trees >> is high but the feature dimensionality is very low, then there is a high >> possibility that multiple copies of the same tree will emerge from the >> forest. >> >> Interpretation #3 - the feature subset is not subsampled per tree, but >> rather all features are presented for the subsampled training data provided >> to a tree ? For example, if the dimensionality is 400 on a 6000-input >> training dataset that has randomly been subsampled (with bootstrap=True) to >> yield 4700 unique training samples, then the tree builder will consider all >> 400 dimensions/features with respect to the 4700 samples, picking at most >> `max_features` number of features (out of 400) for building splits in the >> tree? So by default (sqrt/auto), there would be at most 20 splits in the >> tree? >> >> Confirmations, denials, and corrections to my interpretations are >> _highly_ welcome. >> >> As always, my great thanks to the community. >> >> With kind regards, >> J.B. Brown >> Kyoto University Graduate School of Medicine >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Tue Jun 6 01:26:15 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 5 Jun 2017 22:26:15 -0700 Subject: [scikit-learn] Random Forest max_features and boostrap construction parameters interpretation In-Reply-To: References: Message-ID: Howdy This documentation seems to be split between the RandomForestClassifier documentation that discusses sampling with replacement and the Ensemble documentation that discusses the splits. I agree that it could be made more explicit. If you want to build off of those to make sure that both discuss both features, I would be happy to review it. Jacob On Mon, Jun 5, 2017 at 10:02 PM, Brown J.B. wrote: > Dear Jacob, > > Thank you for this clarification. It is a great help in interpreting the > (good) results that we are obtaining for computational chemogenomics, and > also help in deciding directions of future studies. > > Perhaps then, the random forest documentation (description web page) could > be updated to reflect our discussion, in that it might help others who have > the same questions of interpretation. > > Perhaps, we can add the following (with notation symbols corrected to > match sklearn standards): > ---------- > In general, for a modeling problem with N training instances each having F > features, a random forest of T trees operates by building T decision trees > such that each tree is provided a subsampling of N instances and the F > features for those subsampled instances. > > When bootstrapping is applied, the instance subsampling can potentially > choose the same instance multiple times, in which such an instance will > have elevated weighting. > When bootstrapping is not applied, the entire training set is provided to > the tree-building algorithm. > > Each tree is built by considering a maximum specified number of > informative features at each decision node, such that features with no > variance are excluded from the features to consider for a split and do not > count toward the number of informative features. > Splitting on the informative features can occur as many times as > necessary, unless a maximum depth is specified in the constructor. > Note that an informative feature can be re-applied to form a decision > criteria at more than node in the decision tree. > ---------- > > Adjustments welcome. > > Many thanks again! > J.B. > > > > 2017-06-06 2:54 GMT+09:00 Jacob Schreiber : > >> Howdy >> >> When doing bootstrapping, n samples are selected from the dataset WITH >> replacement, where n is the number of samples in the dataset. This leads to >> situations where some samples have a weight > 1 and others have a weight of >> 0. This is done separately for each tree. >> >> When selecting the number of features, this should be considered more >> like `max_informative_features.` Essentially, if a tree considers splitting >> a feature that is constant, that won't count against the `max_features` >> threshold that is set. This helps guard against situations where many >> uninformative trees are built because the dataset is full of uninformative >> features. You can see this in the code here: ( >> https://github.com/scikit-learn/scikit-learn/blob/14031f65d >> 144e3966113d3daec836e443c6d7a5b/sklearn/tree/_splitter.pyx#L361). This >> is done on a -per split basis-, meaning that a tree can have more than >> `max_features` number of features considered. >> >> In your example, it is not that there would be at most 20 splits in a >> tree, it is that at each split only 20 informative features would be >> considered. You can split on a feature multiple times (consider the example >> where you have one features and x < 0 is class 0, 0 <= x <= 10 is class 1, >> and x > 10 is class 0 again). >> >> Let me know if you have any other questions! >> >> On Mon, Jun 5, 2017 at 7:46 AM, Brown J.B. >> wrote: >> >>> Dear community, >>> >>> This is a question regarding how to interpret the documentation and >>> semantics of the random forest constructors. >>> >>> In forest.py (of version 0.17 which I am still using), the documentation >>> regarding the number of features to consider states on lines 742-745 of the >>> source code that the search may effectively inspect more than >>> `max_features` when determining the features to pick from in order to split >>> a node. >>> It also states that it is tree specific. >>> >>> Am I correct in: >>> >>> Interpretation #1 - For bootstrap=True, sampling with replacement occurs >>> for the number of training instances available, meaning that the subsample >>> presented to a particular tree will have some probability of containing >>> overlaps and therefore not the full input training set, but for >>> bootstrap=False, the entire dataset will be presented to each tree? >>> >>> Interpretation #2 - Particularly, with the way I interpret the >>> documentation stating that "The sub-sample size is always the same as the >>> original input sample size...", it seems to me that bootstrap=False then >>> provides the entire training dataset to each decision tree, and it is a >>> matter of which feature was randomly selected first from the features given >>> that determines what the tree will become. >>> That would suggest that, if bootstrap=False, and if the number of trees >>> is high but the feature dimensionality is very low, then there is a high >>> possibility that multiple copies of the same tree will emerge from the >>> forest. >>> >>> Interpretation #3 - the feature subset is not subsampled per tree, but >>> rather all features are presented for the subsampled training data provided >>> to a tree ? For example, if the dimensionality is 400 on a 6000-input >>> training dataset that has randomly been subsampled (with bootstrap=True) to >>> yield 4700 unique training samples, then the tree builder will consider all >>> 400 dimensions/features with respect to the 4700 samples, picking at most >>> `max_features` number of features (out of 400) for building splits in the >>> tree? So by default (sqrt/auto), there would be at most 20 splits in the >>> tree? >>> >>> Confirmations, denials, and corrections to my interpretations are >>> _highly_ welcome. >>> >>> As always, my great thanks to the community. >>> >>> With kind regards, >>> J.B. Brown >>> Kyoto University Graduate School of Medicine >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.pegliasco at makina-corpus.com Mon Jun 12 15:17:26 2017 From: gael.pegliasco at makina-corpus.com (=?UTF-8?Q?Ga=c3=abl_Pegliasco?=) Date: Mon, 12 Jun 2017 21:17:26 +0200 Subject: [scikit-learn] Documentation proposal Message-ID: Hi, First of all, thanks to all contributors for developping a such rich, simple, well documented and easy to use machine learning library for Python ; which, clearly, plays a big role in Python world domination in AI ! As I'm using it more and more these past month, I've written a french tutorial on machine learning introduction: * The Theory (no code here, only describing AI with Python and machine learning concepts with real examples): https://makina-corpus.com/blog/metier/2017/initiation-au-machine-learning-avec-python-theorie * The Practice (using Scikit-Learn) https://makina-corpus.com/blog/metier/2017/initiation-au-machine-learning-avec-python-pratique Another iris tutorial, but with much more details than most I've read using this database and using both supervised and unsupervised learning I've received a few positive returns regarding these 2 articles and others requests to translate it into english. I think that as to translate it into english, you may find it useful to include it into Scikit-Learn official documentation/examples ? So, if you think it can be useful I could work on it as soon as next week. Anyway, any feedback is welcome, especially because I'm not an expert and that it may not be error safe! Thanks again for your great work and keep going on ! Ga?l, -- Makina Corpus Newsletters | Formations | Twitter -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bchbbdhibcpoljao.png Type: image/png Size: 6215 bytes Desc: not available URL: From daphilip at umich.edu Tue Jun 13 15:36:23 2017 From: daphilip at umich.edu (Daniel Harris) Date: Tue, 13 Jun 2017 15:36:23 -0400 Subject: [scikit-learn] Help with data parsing (link to stack exchange question) Message-ID: Hello, I hope this is the correct email address for questions regarding support. I posted my question here on stack exchange: https://bioinformatics.stackexchange.com/q/702/842 Thank you, Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Jun 13 23:01:10 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 14 Jun 2017 13:01:10 +1000 Subject: [scikit-learn] Need contributors to finish open PRs Message-ID: We appear to have a number of pull requests where the original contributors are no longer available to address comments and complete them. Some of them even have complete positive reviews. Many are listed here: https://github.com/scikit-learn/scikit-learn/pulls?utf8=%E2%9C%93&q=is%3Aopen%20is%3Apr%20label%3A%22Need%20Contributor%22%20 Due to their number, it is difficult for core devs to address the comments, as well as deal with reviewing and managing an upcoming release. It would be great to have volunteers take them over (check out the PR branch as per https://help.github.com/articles/checking-out-pull-requests-locally/, add commits, push the branch to your fork, create a pull request noting that you are continuing a previous one and whether you have addressed all comments). Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Jun 14 04:03:34 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 14 Jun 2017 10:03:34 +0200 Subject: [scikit-learn] Sk-Learn Presentation Message-ID: <20170614080334.GD35516@phare.normalesup.org> You're welcome to download and reuse the PDF from any of the presentations that I have given: https://www.slideshare.net/GaelVaroquaux/presentations None of them is an unbiased comprehensive overview of scikit-learn (not sure how much this possible). Ga?l On Tue, Jun 13, 2017 at 08:55:15PM -0300, Nykollas Alves wrote: > I subscribed me. > Em 13/06/2017 20:43, escreveu: > This list allows posts by subscribers only. Please subscribe at > https://mail.python.org/mailman/listinfo/scikit-learn to post to the > list. > ---------- Mensagem encaminhada ---------- > From:?Nykollas Alves > To:?scikit-learn at python.org > Cc:? > Bcc:? > Date:?Tue, 13 Jun 2017 20:43:18 -0300 > Subject:?Sk-Learn Presentation > Good Night, > I'm Nykollas, a student of IS, I will make a presentation in my university > about scikit-learn. > I would like to know if yous have a Power Point template based in > scikit-learn to provide me. > If not, i understand.! > Thanks in advance, > Sorry my english, i not is native american. -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From jmschreiber91 at gmail.com Wed Jun 14 12:04:05 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Wed, 14 Jun 2017 09:04:05 -0700 Subject: [scikit-learn] Documentation proposal In-Reply-To: References: Message-ID: Hi Gael Thanks for the work! We are grateful for the work that other people do in providing these types of tutorials and introductions as they lower the barrier of entry for new people to get into machine learning. We generally don't include these in the official sklearn documentation, in no small part because it would be a time sink to decide from which among a large group of tutorials should be included. That being said, perhaps we should consider having a 'related tutorials' page similar to the 'related work' page, serving as an aggregation of links? Jacob On Mon, Jun 12, 2017 at 12:17 PM, Ga?l Pegliasco via scikit-learn < scikit-learn at python.org> wrote: > Hi, > > First of all, thanks to all contributors for developping a such rich, > simple, well documented and easy to use machine learning library for Python > ; which, clearly, plays a big role in Python world domination in AI ! > > As I'm using it more and more these past month, I've written a french > tutorial on machine learning introduction: > > - The Theory (no code here, only describing AI with Python and machine > learning concepts with real examples): > https://makina-corpus.com/blog/metier/2017/initiation- > au-machine-learning-avec-python-theorie > > - The Practice (using Scikit-Learn) > https://makina-corpus.com/blog/metier/2017/initiation- > au-machine-learning-avec-python-pratique > > Another iris tutorial, but with much more details than most I've read > using this database and using both supervised and unsupervised learning > > I've received a few positive returns regarding these 2 articles and others > requests to translate it into english. > > I think that as to translate it into english, you may find it useful to > include it into Scikit-Learn official documentation/examples ? > > So, if you think it can be useful I could work on it as soon as next week. > > Anyway, any feedback is welcome, especially because I'm not an expert and > that it may not be error safe! > > Thanks again for your great work and keep going on ! > > Ga?l, > -- > [image: Makina Corpus] > Newsletters | > Formations | Twitter > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bchbbdhibcpoljao.png Type: image/png Size: 6215 bytes Desc: not available URL: From jmschreiber91 at gmail.com Wed Jun 14 12:15:08 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Wed, 14 Jun 2017 09:15:08 -0700 Subject: [scikit-learn] Help with data parsing (link to stack exchange question) In-Reply-To: References: Message-ID: It's unclear to me what exactly you want to do with the classification algorithm. Is your goal to take in a binary data matrix indicating the presence of certain k-mers and predict whether the the present k-mers indicate a susceptible or resistant genome? If so, then you need to convert your sequence into this binary matrix (or possibly count matrix if you think counts are more important) such that each row indicates a genome and each column corresponds to a k-mer. I don't think scikit-learn has any built-in tools for turning a string into a k-mer encoding (possible future PR?) so you'd have to do this manually. Let me know if that answered your question. On Tue, Jun 13, 2017 at 12:36 PM, Daniel Harris wrote: > Hello, > > I hope this is the correct email address for questions regarding support. > I posted my question here on stack exchange: > https://bioinformatics.stackexchange.com/q/702/842 > > Thank you, > Daniel > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zephyr14 at gmail.com Wed Jun 14 16:43:49 2017 From: zephyr14 at gmail.com (Vlad Niculae) Date: Wed, 14 Jun 2017 22:43:49 +0200 Subject: [scikit-learn] Documentation proposal In-Reply-To: References: Message-ID: Indeed, thank you, Gael! My 2c, not thought through very thoroughly, is that although a "related tutorials" would be great, it would be considerably more of a maintenance burden than scikit-learn-contrib, because docs go staler faster than code. We *could* force all code in the doc to be runnable and unit-tested, but that is probably not sufficient, because checking the text cannot really be done automatically. It would be great if we could figure out a system to enable community maintenance of related docs & tutorial without letting them go out of date, I think that's something we can think about. Yours, Vlad On Wed, Jun 14, 2017 at 6:04 PM, Jacob Schreiber wrote: > Hi Gael > > Thanks for the work! We are grateful for the work that other people do in > providing these types of tutorials and introductions as they lower the > barrier of entry for new people to get into machine learning. We generally > don't include these in the official sklearn documentation, in no small part > because it would be a time sink to decide from which among a large group of > tutorials should be included. That being said, perhaps we should consider > having a 'related tutorials' page similar to the 'related work' page, > serving as an aggregation of links? > > Jacob > > On Mon, Jun 12, 2017 at 12:17 PM, Ga?l Pegliasco via scikit-learn < > scikit-learn at python.org> wrote: > >> Hi, >> >> First of all, thanks to all contributors for developping a such rich, >> simple, well documented and easy to use machine learning library for Python >> ; which, clearly, plays a big role in Python world domination in AI ! >> >> As I'm using it more and more these past month, I've written a french >> tutorial on machine learning introduction: >> >> - The Theory (no code here, only describing AI with Python and >> machine learning concepts with real examples): >> https://makina-corpus.com/blog/metier/2017/initiation-au- >> machine-learning-avec-python-theorie >> >> - The Practice (using Scikit-Learn) >> https://makina-corpus.com/blog/metier/2017/initiation-au- >> machine-learning-avec-python-pratique >> >> Another iris tutorial, but with much more details than most I've read >> using this database and using both supervised and unsupervised learning >> >> I've received a few positive returns regarding these 2 articles and >> others requests to translate it into english. >> >> I think that as to translate it into english, you may find it useful to >> include it into Scikit-Learn official documentation/examples ? >> >> So, if you think it can be useful I could work on it as soon as next week. >> >> Anyway, any feedback is welcome, especially because I'm not an expert >> and that it may not be error safe! >> >> Thanks again for your great work and keep going on ! >> >> Ga?l, >> -- >> [image: Makina Corpus] >> Newsletters | >> Formations | Twitter >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bchbbdhibcpoljao.png Type: image/png Size: 6215 bytes Desc: not available URL: From roy at cerfacs.fr Thu Jun 15 05:11:02 2017 From: roy at cerfacs.fr (roy) Date: Thu, 15 Jun 2017 11:11:02 +0200 Subject: [scikit-learn] Sampling capabilities and uncertainty quantification Message-ID: <1D1A7ABA-92DC-418E-83A8-7FC10E3C4982@cerfacs.fr> Hi, From the doc, I have not found a sampling class. To train a gaussian process for example, It is preferable to use sampling techniques (low discrepancy sequences, uniform designs, etc.). Do you plan to do such a thing? Also, I was wondering if you were planning to add variance analysis (Sobol indices). You already have some PCA so it would be complementary. I have some python codes for both topics if needed. Sincerely, Pamphile From gael.varoquaux at normalesup.org Thu Jun 15 08:16:12 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 15 Jun 2017 14:16:12 +0200 Subject: [scikit-learn] Sampling capabilities and uncertainty quantification In-Reply-To: <1D1A7ABA-92DC-418E-83A8-7FC10E3C4982@cerfacs.fr> References: <1D1A7ABA-92DC-418E-83A8-7FC10E3C4982@cerfacs.fr> Message-ID: <20170615121612.GB919691@phare.normalesup.org> We don't do sampling in scikit-learn. Sampling is a set of techniques in it's all. I would advice you to have a look at PyMC. Cheers, Ga?l On Thu, Jun 15, 2017 at 11:11:02AM +0200, roy wrote: > Hi, > >From the doc, I have not found a sampling class. > To train a gaussian process for example, It is preferable to use sampling techniques (low discrepancy sequences, uniform designs, etc.). > Do you plan to do such a thing? > Also, I was wondering if you were planning to add variance analysis (Sobol indices). You already have some PCA so it would be complementary. > I have some python codes for both topics if needed. > Sincerely, > Pamphile > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From Akash.Devgun at colorado.edu Thu Jun 15 19:13:44 2017 From: Akash.Devgun at colorado.edu (Akash Devgun) Date: Thu, 15 Jun 2017 16:13:44 -0700 Subject: [scikit-learn] Need Help Random Forest Imputation Model as in R Message-ID: Please let me know .... Do you have random Forest Imputation model in python-scikit learn similar to rfImpute in R has ? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Thu Jun 15 20:14:50 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Thu, 15 Jun 2017 17:14:50 -0700 Subject: [scikit-learn] Need Help Random Forest Imputation Model as in R In-Reply-To: References: Message-ID: No. On Thu, Jun 15, 2017 at 4:13 PM, Akash Devgun wrote: > Please let me know .... Do you have random Forest Imputation model in > python-scikit learn similar to rfImpute in R has ? > > Thanks > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Akash.Devgun at colorado.edu Thu Jun 15 20:26:35 2017 From: Akash.Devgun at colorado.edu (Akash Devgun) Date: Fri, 16 Jun 2017 00:26:35 +0000 Subject: [scikit-learn] Need Help Random Forest Imputation Model as in R In-Reply-To: References: Message-ID: Will you have in future?? On Thu, Jun 15, 2017 at 5:14 PM Jacob Schreiber wrote: > No. > > On Thu, Jun 15, 2017 at 4:13 PM, Akash Devgun > wrote: > >> Please let me know .... Do you have random Forest Imputation model in >> python-scikit learn similar to rfImpute in R has ? >> >> Thanks >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Thu Jun 15 20:31:28 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Thu, 15 Jun 2017 17:31:28 -0700 Subject: [scikit-learn] Need Help Random Forest Imputation Model as in R In-Reply-To: References: Message-ID: Most likely not. If there is a willing contributor, we would be happy to review a PR though. On Thu, Jun 15, 2017 at 5:26 PM, Akash Devgun wrote: > Will you have in future?? > > On Thu, Jun 15, 2017 at 5:14 PM Jacob Schreiber > wrote: > >> No. >> >> On Thu, Jun 15, 2017 at 4:13 PM, Akash Devgun >> wrote: >> >>> Please let me know .... Do you have random Forest Imputation model in >>> python-scikit learn similar to rfImpute in R has ? >>> >>> Thanks >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jun 15 22:34:05 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 16 Jun 2017 12:34:05 +1000 Subject: [scikit-learn] Need Help Random Forest Imputation Model as in R In-Reply-To: References: Message-ID: Hi Akash, the fancyimpute package (https://pypi.python.org/pypi/fancyimpute) may be of interest. It doesn't implement exactly this, but MICE may be a similar enough technique to give good results. A main difference appears to be that random forest imputation has the notion of proximity weighting, rather than just using a regressor to predict as usual. On 16 June 2017 at 10:31, Jacob Schreiber wrote: > Most likely not. If there is a willing contributor, we would be happy to > review a PR though. > > On Thu, Jun 15, 2017 at 5:26 PM, Akash Devgun > wrote: > >> Will you have in future?? >> >> On Thu, Jun 15, 2017 at 5:14 PM Jacob Schreiber >> wrote: >> >>> No. >>> >>> On Thu, Jun 15, 2017 at 4:13 PM, Akash Devgun >> > wrote: >>> >>>> Please let me know .... Do you have random Forest Imputation model in >>>> python-scikit learn similar to rfImpute in R has ? >>>> >>>> Thanks >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Jun 16 17:35:56 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 16 Jun 2017 17:35:56 -0400 Subject: [scikit-learn] Need Help Random Forest Imputation Model as in R In-Reply-To: References: Message-ID: <043092d0-a43e-69a1-364b-fa194242facf@gmail.com> Why not? I thought we wanted to add estimator-based imputation. The problem with fancyimpute is that it has no notion of test set, so you can't apply it to new data. Cheers, Andy On 06/15/2017 08:31 PM, Jacob Schreiber wrote: > Most likely not. If there is a willing contributor, we would be happy > to review a PR though. > > On Thu, Jun 15, 2017 at 5:26 PM, Akash Devgun > > wrote: > > Will you have in future?? > > On Thu, Jun 15, 2017 at 5:14 PM Jacob Schreiber > > wrote: > > No. > > On Thu, Jun 15, 2017 at 4:13 PM, Akash Devgun > > > wrote: > > Please let me know .... Do you have random Forest > Imputation model in python-scikit learn similar to > rfImpute in R has ? > > Thanks > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Jun 16 17:39:48 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 16 Jun 2017 17:39:48 -0400 Subject: [scikit-learn] Documentation proposal In-Reply-To: References: Message-ID: <5b7d8fb0-e408-95af-6729-ae831e046ba4@gmail.com> I'm pretty sure there's a new scikit-learn blog-post about every day, with highly varying quality. I don't think it's a good idea to spend our time reviewing them. On 06/14/2017 04:43 PM, Vlad Niculae wrote: > Indeed, thank you, Gael! > > My 2c, not thought through very thoroughly, is that although a > "related tutorials" would be great, it would be considerably more of a > maintenance burden than scikit-learn-contrib, because docs go staler > faster than code. We *could* force all code in the doc to be runnable > and unit-tested, but that is probably not sufficient, because checking > the text cannot really be done automatically. It would be great if we > could figure out a system to enable community maintenance of related > docs & tutorial without letting them go out of date, I think that's > something we can think about. > > Yours, > Vlad > > On Wed, Jun 14, 2017 at 6:04 PM, Jacob Schreiber > > wrote: > > Hi Gael > > Thanks for the work! We are grateful for the work that other > people do in providing these types of tutorials and introductions > as they lower the barrier of entry for new people to get into > machine learning. We generally don't include these in the official > sklearn documentation, in no small part because it would be a time > sink to decide from which among a large group of tutorials should > be included. That being said, perhaps we should consider having a > 'related tutorials' page similar to the 'related work' page, > serving as an aggregation of links? > > Jacob > > On Mon, Jun 12, 2017 at 12:17 PM, Ga?l Pegliasco via scikit-learn > > wrote: > > Hi, > > First of all, thanks to all contributors for developping a > such rich, simple, well documented and easy to use machine > learning library for Python ; which, clearly, plays a big role > in Python world domination in AI ! > > As I'm using it more and more these past month, I've written a > french tutorial on machine learning introduction: > > * The Theory (no code here, only describing AI with Python > and machine learning concepts with real examples): > https://makina-corpus.com/blog/metier/2017/initiation-au-machine-learning-avec-python-theorie > > * The Practice (using Scikit-Learn) > https://makina-corpus.com/blog/metier/2017/initiation-au-machine-learning-avec-python-pratique > > Another iris tutorial, but with much more details than > most I've read using this database and using both > supervised and unsupervised learning > > I've received a few positive returns regarding these 2 > articles and others requests to translate it into english. > > I think that as to translate it into english, you may find it > useful to include it into Scikit-Learn official > documentation/examples ? > > So, if you think it can be useful I could work on it as soon > as next week. > > Anyway, any feedback is welcome, especially because I'm not an > expert and that it may not be error safe! > > Thanks again for your great work and keep going on ! > > Ga?l, > > -- > Makina Corpus > Newsletters > | Formations | Twitter > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bchbbdhibcpoljao.png Type: image/png Size: 6215 bytes Desc: not available URL: From jmschreiber91 at gmail.com Sun Jun 18 03:07:31 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sun, 18 Jun 2017 07:07:31 +0000 Subject: [scikit-learn] Need Help Random Forest Imputation Model as in R In-Reply-To: <043092d0-a43e-69a1-364b-fa194242facf@gmail.com> References: <043092d0-a43e-69a1-364b-fa194242facf@gmail.com> Message-ID: I misspoke. I didn't mean that there is a reason not to support it, just that there are no current plans to support it and that we would welcome a willing contributor to get it rolling. On Fri, Jun 16, 2017 at 2:36 PM Andreas Mueller wrote: > Why not? > I thought we wanted to add estimator-based imputation. > The problem with fancyimpute is that it has no notion of test set, so you > can't apply it to new data. > > Cheers, > Andy > > > > On 06/15/2017 08:31 PM, Jacob Schreiber wrote: > > Most likely not. If there is a willing contributor, we would be happy to > review a PR though. > > On Thu, Jun 15, 2017 at 5:26 PM, Akash Devgun > wrote: > >> Will you have in future?? >> >> On Thu, Jun 15, 2017 at 5:14 PM Jacob Schreiber >> wrote: >> >>> No. >>> >>> On Thu, Jun 15, 2017 at 4:13 PM, Akash Devgun >> > wrote: >>> >>>> Please let me know .... Do you have random Forest Imputation model in >>>> python-scikit learn similar to rfImpute in R has ? >>>> >>>> Thanks >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Sun Jun 18 12:02:06 2017 From: tmrsg11 at gmail.com (C W) Date: Sun, 18 Jun 2017 12:02:06 -0400 Subject: [scikit-learn] R user trying to learn Python Message-ID: Dear Scikit-learn, What are some good ways and resources to learn Python for data analysis? I am extremely frustrated using this thing. Everything comes after a dot! Why would you type the sam thing at the beginning of every line. It's not efficient. code 1: y_sin = np.sin(x) y_cos = np.cos(x) I know you can import the entire package without the "as np", but I see np.something as the standard. Why? Code 2: model = LogisticRegression() model.fit(X_train, y_train) model.score(X_test, y_test) In R, everything is saved to a variable. In the code above, what if I accidentally ran model.fit(), I would not know. Code 3: from sklearn import linear_model reg = linear_model.Ridge (alpha = .5) reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) In the code above, sklearn > linear_model > Ridge, one lives inside the other, it feels that there are multiple layer, how deep do I have to dig in? Can someone explain the mentality behind this setup? Thank you very much! M -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Jun 18 12:53:03 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 18 Jun 2017 12:53:03 -0400 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: References: Message-ID: <52C09D63-DAB0-44B2-B58F-71192B9EC956@gmail.com> Hi, > I am extremely frustrated using this thing. Everything comes after a dot! Why would you type the sam thing at the beginning of every line. It's not efficient. > > code 1: > y_sin = np.sin(x) > y_cos = np.cos(x) > > I know you can import the entire package without the "as np", but I see np.something as the standard. Why? Because it makes it clear where this function is coming from. Sure, you could do from numpy import * but this is NOT!!! recommended. The reason why this is not recommended is that it would clutter up your main name space. For instance, numpy has its own sum function. If you do from numpy import *, Python's in-built `sum` will be gone from your main name space and replaced by NumPy's sum. This is confusing and should be avoided. > In the code above, sklearn > linear_model > Ridge, one lives inside the other, it feels that there are multiple layer, how deep do I have to dig in? > > Can someone explain the mentality behind this setup? This is one way to organize your code and package. Sklearn contains many things, and organizing it by subpackages (linear_model, svm, ...) makes only sense; otherwise, you would end up with code files > 100,000 lines or so, which would make life really hard for package developers. Here, scikit-learn tries to follow the core principles of good object oriented program design, for instance, Abstraction, encapsulation, modularity, hierarchy, ... > What are some good ways and resources to learn Python for data analysis? I think baed on your questions, a good resource would be an introduction to programming book or course. I think that sections on objected oriented programming would make the rationale/design/API of scikit-learn and Python classes as a whole more accessible and address your concerns and questions. Best, Sebastian > On Jun 18, 2017, at 12:02 PM, C W wrote: > > Dear Scikit-learn, > > What are some good ways and resources to learn Python for data analysis? > > I am extremely frustrated using this thing. Everything comes after a dot! Why would you type the sam thing at the beginning of every line. It's not efficient. > > code 1: > y_sin = np.sin(x) > y_cos = np.cos(x) > > I know you can import the entire package without the "as np", but I see np.something as the standard. Why? > > Code 2: > model = LogisticRegression() > model.fit(X_train, y_train) > model.score(X_test, y_test) > > In R, everything is saved to a variable. In the code above, what if I accidentally ran model.fit(), I would not know. > > Code 3: > from sklearn import linear_model > reg = linear_model.Ridge (alpha = .5) > reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) > > In the code above, sklearn > linear_model > Ridge, one lives inside the other, it feels that there are multiple layer, how deep do I have to dig in? > > Can someone explain the mentality behind this setup? > > Thank you very much! > > M > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tmrsg11 at gmail.com Sun Jun 18 16:18:37 2017 From: tmrsg11 at gmail.com (C W) Date: Sun, 18 Jun 2017 16:18:37 -0400 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: <52C09D63-DAB0-44B2-B58F-71192B9EC956@gmail.com> References: <52C09D63-DAB0-44B2-B58F-71192B9EC956@gmail.com> Message-ID: Hi Sebastian, I looked through your book. I think it is great if you already know Python, and looking to learn machine learning. For me, I have some sense of machine learning, but none of Python. Unlike R, which is specifically for statistics analysis. Python is broad! Maybe some expert here with R can tell me how to go about this. :) On Sun, Jun 18, 2017 at 12:53 PM, Sebastian Raschka wrote: > Hi, > > > I am extremely frustrated using this thing. Everything comes after a > dot! Why would you type the sam thing at the beginning of every line. It's > not efficient. > > > > code 1: > > y_sin = np.sin(x) > > y_cos = np.cos(x) > > > > I know you can import the entire package without the "as np", but I see > np.something as the standard. Why? > > Because it makes it clear where this function is coming from. Sure, you > could do > > from numpy import * > > but this is NOT!!! recommended. The reason why this is not recommended is > that it would clutter up your main name space. For instance, numpy has its > own sum function. If you do from numpy import *, Python's in-built `sum` > will be gone from your main name space and replaced by NumPy's sum. This is > confusing and should be avoided. > > > In the code above, sklearn > linear_model > Ridge, one lives inside the > other, it feels that there are multiple layer, how deep do I have to dig in? > > > > Can someone explain the mentality behind this setup? > > This is one way to organize your code and package. Sklearn contains many > things, and organizing it by subpackages (linear_model, svm, ...) makes > only sense; otherwise, you would end up with code files > 100,000 lines or > so, which would make life really hard for package developers. > > Here, scikit-learn tries to follow the core principles of good object > oriented program design, for instance, Abstraction, encapsulation, > modularity, hierarchy, ... > > > What are some good ways and resources to learn Python for data analysis? > > I think baed on your questions, a good resource would be an introduction > to programming book or course. I think that sections on objected oriented > programming would make the rationale/design/API of scikit-learn and Python > classes as a whole more accessible and address your concerns and questions. > > Best, > Sebastian > > > On Jun 18, 2017, at 12:02 PM, C W wrote: > > > > Dear Scikit-learn, > > > > What are some good ways and resources to learn Python for data analysis? > > > > I am extremely frustrated using this thing. Everything comes after a > dot! Why would you type the sam thing at the beginning of every line. It's > not efficient. > > > > code 1: > > y_sin = np.sin(x) > > y_cos = np.cos(x) > > > > I know you can import the entire package without the "as np", but I see > np.something as the standard. Why? > > > > Code 2: > > model = LogisticRegression() > > model.fit(X_train, y_train) > > model.score(X_test, y_test) > > > > In R, everything is saved to a variable. In the code above, what if I > accidentally ran model.fit(), I would not know. > > > > Code 3: > > from sklearn import linear_model > > reg = linear_model.Ridge (alpha = .5) > > reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) > > > > In the code above, sklearn > linear_model > Ridge, one lives inside the > other, it feels that there are multiple layer, how deep do I have to dig in? > > > > Can someone explain the mentality behind this setup? > > > > Thank you very much! > > > > M > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.violante at gmail.com Sun Jun 18 16:34:10 2017 From: sean.violante at gmail.com (Sean Violante) Date: Sun, 18 Jun 2017 22:34:10 +0200 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: References: <52C09D63-DAB0-44B2-B58F-71192B9EC956@gmail.com> Message-ID: CW you might want to read http://greenteapress.com/wp/think-python/ (available as free pdf) (for basics of programming and python) and Python for Data Analysis Data Wrangling with Pandas, NumPy, and IPython, O'reilly (for data analysis libraries: pandas, numpy, ipython...) On Sun, Jun 18, 2017 at 10:18 PM, C W wrote: > Hi Sebastian, > > I looked through your book. I think it is great if you already know > Python, and looking to learn machine learning. > > For me, I have some sense of machine learning, but none of Python. > > Unlike R, which is specifically for statistics analysis. Python is broad! > > Maybe some expert here with R can tell me how to go about this. :) > > On Sun, Jun 18, 2017 at 12:53 PM, Sebastian Raschka > wrote: > >> Hi, >> >> > I am extremely frustrated using this thing. Everything comes after a >> dot! Why would you type the sam thing at the beginning of every line. It's >> not efficient. >> > >> > code 1: >> > y_sin = np.sin(x) >> > y_cos = np.cos(x) >> > >> > I know you can import the entire package without the "as np", but I see >> np.something as the standard. Why? >> >> Because it makes it clear where this function is coming from. Sure, you >> could do >> >> from numpy import * >> >> but this is NOT!!! recommended. The reason why this is not recommended is >> that it would clutter up your main name space. For instance, numpy has its >> own sum function. If you do from numpy import *, Python's in-built `sum` >> will be gone from your main name space and replaced by NumPy's sum. This is >> confusing and should be avoided. >> >> > In the code above, sklearn > linear_model > Ridge, one lives inside the >> other, it feels that there are multiple layer, how deep do I have to dig in? >> > >> > Can someone explain the mentality behind this setup? >> >> This is one way to organize your code and package. Sklearn contains many >> things, and organizing it by subpackages (linear_model, svm, ...) makes >> only sense; otherwise, you would end up with code files > 100,000 lines or >> so, which would make life really hard for package developers. >> >> Here, scikit-learn tries to follow the core principles of good object >> oriented program design, for instance, Abstraction, encapsulation, >> modularity, hierarchy, ... >> >> > What are some good ways and resources to learn Python for data analysis? >> >> I think baed on your questions, a good resource would be an introduction >> to programming book or course. I think that sections on objected oriented >> programming would make the rationale/design/API of scikit-learn and Python >> classes as a whole more accessible and address your concerns and questions. >> >> Best, >> Sebastian >> >> > On Jun 18, 2017, at 12:02 PM, C W wrote: >> > >> > Dear Scikit-learn, >> > >> > What are some good ways and resources to learn Python for data analysis? >> > >> > I am extremely frustrated using this thing. Everything comes after a >> dot! Why would you type the sam thing at the beginning of every line. It's >> not efficient. >> > >> > code 1: >> > y_sin = np.sin(x) >> > y_cos = np.cos(x) >> > >> > I know you can import the entire package without the "as np", but I see >> np.something as the standard. Why? >> > >> > Code 2: >> > model = LogisticRegression() >> > model.fit(X_train, y_train) >> > model.score(X_test, y_test) >> > >> > In R, everything is saved to a variable. In the code above, what if I >> accidentally ran model.fit(), I would not know. >> > >> > Code 3: >> > from sklearn import linear_model >> > reg = linear_model.Ridge (alpha = .5) >> > reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) >> > >> > In the code above, sklearn > linear_model > Ridge, one lives inside the >> other, it feels that there are multiple layer, how deep do I have to dig in? >> > >> > Can someone explain the mentality behind this setup? >> > >> > Thank you very much! >> > >> > M >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Jun 18 16:27:37 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 18 Jun 2017 16:27:37 -0400 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: References: <52C09D63-DAB0-44B2-B58F-71192B9EC956@gmail.com> Message-ID: <713D2FA0-76EA-4DF9-B29A-4499C4E70B36@gmail.com> Hi, C W, yeah I'd say that Python is a programming language with lots of packages for scientific computing, whereas R is more of a toolbox for stats. Thus, Python may be a bit weird at first for people who come from the R/stats field and are new to programming. Not sure if it is necessary to learn programming & computer science basics for a person who is primarily interested in in stats and ML, but since so many tools are Python-based and require some sort of basic programming to fit the pieces together, it's maybe not a bad idea :). There's probably an over-abundance of python intro books out there ... However, I'd maybe recommend a introduction to computer science book that uses Python as a teaching language rather than a book that is just about Python language. Maybe check out https://www.udacity.com/course/intro-to-computer-science--cs101, which is a Python-based computer science course (and should be free). Best, Sebastian > On Jun 18, 2017, at 4:18 PM, C W wrote: > > Hi Sebastian, > > I looked through your book. I think it is great if you already know Python, and looking to learn machine learning. > > For me, I have some sense of machine learning, but none of Python. > > Unlike R, which is specifically for statistics analysis. Python is broad! > > Maybe some expert here with R can tell me how to go about this. :) > > On Sun, Jun 18, 2017 at 12:53 PM, Sebastian Raschka wrote: > Hi, > > > I am extremely frustrated using this thing. Everything comes after a dot! Why would you type the sam thing at the beginning of every line. It's not efficient. > > > > code 1: > > y_sin = np.sin(x) > > y_cos = np.cos(x) > > > > I know you can import the entire package without the "as np", but I see np.something as the standard. Why? > > Because it makes it clear where this function is coming from. Sure, you could do > > from numpy import * > > but this is NOT!!! recommended. The reason why this is not recommended is that it would clutter up your main name space. For instance, numpy has its own sum function. If you do from numpy import *, Python's in-built `sum` will be gone from your main name space and replaced by NumPy's sum. This is confusing and should be avoided. > > > In the code above, sklearn > linear_model > Ridge, one lives inside the other, it feels that there are multiple layer, how deep do I have to dig in? > > > > Can someone explain the mentality behind this setup? > > This is one way to organize your code and package. Sklearn contains many things, and organizing it by subpackages (linear_model, svm, ...) makes only sense; otherwise, you would end up with code files > 100,000 lines or so, which would make life really hard for package developers. > > Here, scikit-learn tries to follow the core principles of good object oriented program design, for instance, Abstraction, encapsulation, modularity, hierarchy, ... > > > What are some good ways and resources to learn Python for data analysis? > > I think baed on your questions, a good resource would be an introduction to programming book or course. I think that sections on objected oriented programming would make the rationale/design/API of scikit-learn and Python classes as a whole more accessible and address your concerns and questions. > > Best, > Sebastian > > > On Jun 18, 2017, at 12:02 PM, C W wrote: > > > > Dear Scikit-learn, > > > > What are some good ways and resources to learn Python for data analysis? > > > > I am extremely frustrated using this thing. Everything comes after a dot! Why would you type the sam thing at the beginning of every line. It's not efficient. > > > > code 1: > > y_sin = np.sin(x) > > y_cos = np.cos(x) > > > > I know you can import the entire package without the "as np", but I see np.something as the standard. Why? > > > > Code 2: > > model = LogisticRegression() > > model.fit(X_train, y_train) > > model.score(X_test, y_test) > > > > In R, everything is saved to a variable. In the code above, what if I accidentally ran model.fit(), I would not know. > > > > Code 3: > > from sklearn import linear_model > > reg = linear_model.Ridge (alpha = .5) > > reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) > > > > In the code above, sklearn > linear_model > Ridge, one lives inside the other, it feels that there are multiple layer, how deep do I have to dig in? > > > > Can someone explain the mentality behind this setup? > > > > Thank you very much! > > > > M > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From nelle.varoquaux at gmail.com Sun Jun 18 16:37:30 2017 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Sun, 18 Jun 2017 13:37:30 -0700 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: References: <52C09D63-DAB0-44B2-B58F-71192B9EC956@gmail.com> Message-ID: Hello, The concepts behind R and python are entirely different. Python is meant to be as explicit as possible, and uses the concepts of namespace which R doesn't. While it can seem that python code is more verbose, it is very clear when reading python code which functions come from which module and submodule (this is link to your code 1 and code 3 examples). For example 2, R indeed saves everything to a variable, while python does not. The advantage is that Python is much more time and memory efficient than R. The tradeoff is that you do not keep intermediate results. Hope that explains, N On 18 June 2017 at 13:18, C W wrote: > Hi Sebastian, > > I looked through your book. I think it is great if you already know Python, > and looking to learn machine learning. > > For me, I have some sense of machine learning, but none of Python. > > Unlike R, which is specifically for statistics analysis. Python is broad! > > Maybe some expert here with R can tell me how to go about this. :) > > On Sun, Jun 18, 2017 at 12:53 PM, Sebastian Raschka > wrote: >> >> Hi, >> >> > I am extremely frustrated using this thing. Everything comes after a >> > dot! Why would you type the sam thing at the beginning of every line. It's >> > not efficient. >> > >> > code 1: >> > y_sin = np.sin(x) >> > y_cos = np.cos(x) >> > >> > I know you can import the entire package without the "as np", but I see >> > np.something as the standard. Why? >> >> Because it makes it clear where this function is coming from. Sure, you >> could do >> >> from numpy import * >> >> but this is NOT!!! recommended. The reason why this is not recommended is >> that it would clutter up your main name space. For instance, numpy has its >> own sum function. If you do from numpy import *, Python's in-built `sum` >> will be gone from your main name space and replaced by NumPy's sum. This is >> confusing and should be avoided. >> >> > In the code above, sklearn > linear_model > Ridge, one lives inside the >> > other, it feels that there are multiple layer, how deep do I have to dig in? >> > >> > Can someone explain the mentality behind this setup? >> >> This is one way to organize your code and package. Sklearn contains many >> things, and organizing it by subpackages (linear_model, svm, ...) makes only >> sense; otherwise, you would end up with code files > 100,000 lines or so, >> which would make life really hard for package developers. >> >> Here, scikit-learn tries to follow the core principles of good object >> oriented program design, for instance, Abstraction, encapsulation, >> modularity, hierarchy, ... >> >> > What are some good ways and resources to learn Python for data analysis? >> >> I think baed on your questions, a good resource would be an introduction >> to programming book or course. I think that sections on objected oriented >> programming would make the rationale/design/API of scikit-learn and Python >> classes as a whole more accessible and address your concerns and questions. >> >> Best, >> Sebastian >> >> > On Jun 18, 2017, at 12:02 PM, C W wrote: >> > >> > Dear Scikit-learn, >> > >> > What are some good ways and resources to learn Python for data analysis? >> > >> > I am extremely frustrated using this thing. Everything comes after a >> > dot! Why would you type the sam thing at the beginning of every line. It's >> > not efficient. >> > >> > code 1: >> > y_sin = np.sin(x) >> > y_cos = np.cos(x) >> > >> > I know you can import the entire package without the "as np", but I see >> > np.something as the standard. Why? >> > >> > Code 2: >> > model = LogisticRegression() >> > model.fit(X_train, y_train) >> > model.score(X_test, y_test) >> > >> > In R, everything is saved to a variable. In the code above, what if I >> > accidentally ran model.fit(), I would not know. >> > >> > Code 3: >> > from sklearn import linear_model >> > reg = linear_model.Ridge (alpha = .5) >> > reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) >> > >> > In the code above, sklearn > linear_model > Ridge, one lives inside the >> > other, it feels that there are multiple layer, how deep do I have to dig in? >> > >> > Can someone explain the mentality behind this setup? >> > >> > Thank you very much! >> > >> > M >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From tmrsg11 at gmail.com Sun Jun 18 17:32:31 2017 From: tmrsg11 at gmail.com (C W) Date: Sun, 18 Jun 2017 17:32:31 -0400 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: References: <52C09D63-DAB0-44B2-B58F-71192B9EC956@gmail.com> Message-ID: Thank you all for the love! Sean, I think your recommendation is perfect! It covers everything, very concise, to the point. Sebastian, I will certainly invest time into that course when I have time. Nelle, I agree! And from what I read, thee head(), tail(), and data.frame() in Python actually came from R at request. Hence, I came to think they are similar. For anyone else in the world reading, I think pandas doc is also good: http://pandas.pydata.org/pandas-docs/stable/pandas.pdf Mike On Sun, Jun 18, 2017 at 4:37 PM, Nelle Varoquaux wrote: > Hello, > > The concepts behind R and python are entirely different. Python is > meant to be as explicit as possible, and uses the concepts of > namespace which R doesn't. > While it can seem that python code is more verbose, it is very clear > when reading python code which functions come from which module and > submodule (this is link to your code 1 and code 3 examples). > > For example 2, R indeed saves everything to a variable, while python > does not. The advantage is that Python is much more time and memory > efficient than R. The tradeoff is that you do not keep intermediate > results. > > Hope that explains, > N > > On 18 June 2017 at 13:18, C W wrote: > > Hi Sebastian, > > > > I looked through your book. I think it is great if you already know > Python, > > and looking to learn machine learning. > > > > For me, I have some sense of machine learning, but none of Python. > > > > Unlike R, which is specifically for statistics analysis. Python is broad! > > > > Maybe some expert here with R can tell me how to go about this. :) > > > > On Sun, Jun 18, 2017 at 12:53 PM, Sebastian Raschka < > se.raschka at gmail.com> > > wrote: > >> > >> Hi, > >> > >> > I am extremely frustrated using this thing. Everything comes after a > >> > dot! Why would you type the sam thing at the beginning of every line. > It's > >> > not efficient. > >> > > >> > code 1: > >> > y_sin = np.sin(x) > >> > y_cos = np.cos(x) > >> > > >> > I know you can import the entire package without the "as np", but I > see > >> > np.something as the standard. Why? > >> > >> Because it makes it clear where this function is coming from. Sure, you > >> could do > >> > >> from numpy import * > >> > >> but this is NOT!!! recommended. The reason why this is not recommended > is > >> that it would clutter up your main name space. For instance, numpy has > its > >> own sum function. If you do from numpy import *, Python's in-built `sum` > >> will be gone from your main name space and replaced by NumPy's sum. > This is > >> confusing and should be avoided. > >> > >> > In the code above, sklearn > linear_model > Ridge, one lives inside > the > >> > other, it feels that there are multiple layer, how deep do I have to > dig in? > >> > > >> > Can someone explain the mentality behind this setup? > >> > >> This is one way to organize your code and package. Sklearn contains many > >> things, and organizing it by subpackages (linear_model, svm, ...) makes > only > >> sense; otherwise, you would end up with code files > 100,000 lines or > so, > >> which would make life really hard for package developers. > >> > >> Here, scikit-learn tries to follow the core principles of good object > >> oriented program design, for instance, Abstraction, encapsulation, > >> modularity, hierarchy, ... > >> > >> > What are some good ways and resources to learn Python for data > analysis? > >> > >> I think baed on your questions, a good resource would be an introduction > >> to programming book or course. I think that sections on objected > oriented > >> programming would make the rationale/design/API of scikit-learn and > Python > >> classes as a whole more accessible and address your concerns and > questions. > >> > >> Best, > >> Sebastian > >> > >> > On Jun 18, 2017, at 12:02 PM, C W wrote: > >> > > >> > Dear Scikit-learn, > >> > > >> > What are some good ways and resources to learn Python for data > analysis? > >> > > >> > I am extremely frustrated using this thing. Everything comes after a > >> > dot! Why would you type the sam thing at the beginning of every line. > It's > >> > not efficient. > >> > > >> > code 1: > >> > y_sin = np.sin(x) > >> > y_cos = np.cos(x) > >> > > >> > I know you can import the entire package without the "as np", but I > see > >> > np.something as the standard. Why? > >> > > >> > Code 2: > >> > model = LogisticRegression() > >> > model.fit(X_train, y_train) > >> > model.score(X_test, y_test) > >> > > >> > In R, everything is saved to a variable. In the code above, what if I > >> > accidentally ran model.fit(), I would not know. > >> > > >> > Code 3: > >> > from sklearn import linear_model > >> > reg = linear_model.Ridge (alpha = .5) > >> > reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) > >> > > >> > In the code above, sklearn > linear_model > Ridge, one lives inside > the > >> > other, it feels that there are multiple layer, how deep do I have to > dig in? > >> > > >> > Can someone explain the mentality behind this setup? > >> > > >> > Thank you very much! > >> > > >> > M > >> > _______________________________________________ > >> > scikit-learn mailing list > >> > scikit-learn at python.org > >> > https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From massimodisasha at gmail.com Sun Jun 18 18:42:13 2017 From: massimodisasha at gmail.com (massimo di stefano) Date: Sun, 18 Jun 2017 18:42:13 -0400 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: References: Message-ID: <3B2DEF95-CBA6-468C-B8A5-6B8BC861FBAE@gmail.com> Hi, along with all the great tips you received, perhaps you may find this useful: http://www.cert.org/flocon/2011/matlab-python-xref.pdf I know is not on-topic with your question, but I found it very useful when I start to use python (coming from R) So I thought it was worth to post it here. It is very old but those basic functions are pretty stable. The python code assumes a: from numpy import * which others already explained you why is good practice to avoid it, ?Massimo. > On Jun 18, 2017, at 12:02 PM, C W wrote: > > Dear Scikit-learn, > > What are some good ways and resources to learn Python for data analysis? > > I am extremely frustrated using this thing. Everything comes after a dot! Why would you type the sam thing at the beginning of every line. It's not efficient. > > code 1: > y_sin = np.sin(x) > y_cos = np.cos(x) > > I know you can import the entire package without the "as np", but I see np.something as the standard. Why? > > Code 2: > model = LogisticRegression() > model.fit(X_train, y_train) > model.score(X_test, y_test) > > In R, everything is saved to a variable. In the code above, what if I accidentally ran model.fit(), I would not know. > > Code 3: > from sklearn import linear_model > reg = linear_model.Ridge (alpha = .5) > reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) > > In the code above, sklearn > linear_model > Ridge, one lives inside the other, it feels that there are multiple layer, how deep do I have to dig in? > > Can someone explain the mentality behind this setup? > > Thank you very much! > > M > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mail at ingofruend.net Sun Jun 18 19:07:37 2017 From: mail at ingofruend.net (mail) Date: Sun, 18 Jun 2017 19:07:37 -0400 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: Message-ID: <00f7d068-067f-4f45-b5c4-d53aab63c9ae@localhost> Hi M, I think what you describe can be summarized as the difference of a domain specific language (r) and a general purpose language (Python). Most of what you describe is related to namespaces - "one honking great" feature of python. Namespaces are less needed in r because r is domain specific. But if you write your webserver's frontend, database access, prediction engine, user authentication, and what not all in Python (or at least large part of it), then namespaces help a lot keeping those domains apart. I also added a couple of more specific answers to your points below, but I somehow can't make them appear as "not reply". I hope you all find them. Hope that helps, Ingo > > > > I am extremely frustrated using this thing. Everything comes after a dot! Why would you type the sam thing at the beginning of every line. It's not efficient. > > > > > This is mostly the Python way to do namespaces. Although it may not be efficient when you type, it is efficient when you debug: you always get both function/method *and* the context in which it was executed. > > > > code 1: > > y_sin = np.sin(x) > > y_cos = np.cos(x) > > > > I know you can import the entire package without the "as np", but I see np.something as the standard. Why? > > > > Imagine you were doing an analysis for the Catholic church. Obviously sins would play and important role. So there might be a function that's called "sin" somewhere that does something entirely different from a trigonometric function. Ok, maybe this is a bad example but you get the idea. In this case it might even be a real issue because math.sin and numpy.sin do different but similar things. That could be difficult to debug and it's handy to mark which one you are using where. > > > > > > Code 2: > > model = LogisticRegression() > > model.fit(X_train, y_train) > > > model.score(X_test, y_test) > > > > In R, everything is saved to a variable. In the code above, what if I accidentally ran model.fit(), I would not know. > > > > - That's right. You would not know. This is a design decision in sklearn. There are advantages and disadvantages to it. Sklearn is using stateful objects here. For those you would expect to change them by calling their methods. Note though, that the methods you call on your model also return values that are likely what you expect them to return. > > > > Code 3: > from sklearn import linear_model > reg = linear_model.Ridge (alpha = .5) > reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) > > > In the code above, sklearn > linear_model > Ridge, one lives inside the other, it feels that there are multiple layer, how deep do I have to dig in? > > > > - Again, this is the namespace idea. Python allows you to group functions, classes, and even namespaces themselves in namespaces. For larger packages, this can be very useful because you can structure your code accordingly. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Jun 19 02:12:22 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 19 Jun 2017 08:12:22 +0200 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: References: <52C09D63-DAB0-44B2-B58F-71192B9EC956@gmail.com> Message-ID: <20170619061222.GA1243059@phare.normalesup.org> Another reference that I like a lot for people who already know a programming language and are trying to learn Python is "Python Essential Reference" by David Beazley. It gives a good understanding of how Python works, though it does not talk about numerical computing libraries. Ga?l From gael.varoquaux at normalesup.org Mon Jun 19 02:13:13 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 19 Jun 2017 08:13:13 +0200 Subject: [scikit-learn] Need Help Random Forest Imputation Model as in R In-Reply-To: References: <043092d0-a43e-69a1-364b-fa194242facf@gmail.com> Message-ID: <20170619061313.GB1243059@phare.normalesup.org> > I misspoke. I didn't mean that there is a reason not to support it, > just that there are no current plans to support it and that we would > welcome a willing contributor to get it rolling.? I thought that there was a PR looking at it? Ga?l From joel.nothman at gmail.com Mon Jun 19 03:29:34 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 19 Jun 2017 17:29:34 +1000 Subject: [scikit-learn] Need Help Random Forest Imputation Model as in R In-Reply-To: <20170619061313.GB1243059@phare.normalesup.org> References: <043092d0-a43e-69a1-364b-fa194242facf@gmail.com> <20170619061313.GB1243059@phare.normalesup.org> Message-ID: There's a PR about handling missing values in RF, and a PR about imputing with more sophistication than a single, global feature-wise statistic, but nothing about RF imputation. On 19 June 2017 at 16:13, Gael Varoquaux wrote: > > I misspoke. I didn't mean that there is a reason not to support it, > > just that there are no current plans to support it and that we would > > welcome a willing contributor to get it rolling. > > I thought that there was a PR looking at it? > > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Jun 19 05:47:01 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 19 Jun 2017 11:47:01 +0200 Subject: [scikit-learn] Need Help Random Forest Imputation Model as in R In-Reply-To: References: <043092d0-a43e-69a1-364b-fa194242facf@gmail.com> <20170619061313.GB1243059@phare.normalesup.org> Message-ID: <20170619094701.GD4005469@phare.normalesup.org> Point taken. G On Mon, Jun 19, 2017 at 05:29:34PM +1000, Joel Nothman wrote: > There's a PR about handling missing values in RF, and a PR about imputing with > more sophistication than a single, global feature-wise statistic, but nothing > about RF imputation. > On 19 June 2017 at 16:13, Gael Varoquaux wrote: > > I misspoke. I didn't mean that there is a reason not to support it, > > just that there are no current plans to support it and that we would > > welcome a willing contributor to get it rolling.? > I thought that there was a PR looking at it? > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From gael.pegliasco at makina-corpus.com Tue Jun 20 00:28:36 2017 From: gael.pegliasco at makina-corpus.com (=?UTF-8?Q?Ga=c3=abl_Pegliasco?=) Date: Tue, 20 Jun 2017 06:28:36 +0200 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: References: Message-ID: Hi, You may find these R/Python comparison-sheets useful in understanding both languages syntaxes and concepts: * https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis * http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html Ga?l, Le 18/06/2017 ? 18:02, C W a ?crit : > Dear Scikit-learn, > > What are some good ways and resources to learn Python for data analysis? > > I am extremely frustrated using this thing. Everything comes after a > dot! Why would you type the sam thing at the beginning of every line. > It's not efficient. > > code 1: > y_sin = np.sin(x) > y_cos = np.cos(x) > > I know you can import the entire package without the "as np", but I > see np.something as the standard. Why? > > Code 2: > model = LogisticRegression() > model.fit(X_train, y_train) > model.score(X_test, y_test) > > In R, everything is saved to a variable. In the code above, what if I > accidentally ran model.fit(), I would not know. > > Code 3: > from sklearn import linear_model > reg = linear_model.Ridge (alpha = .5) > reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) > > In the code above, sklearn > linear_model > Ridge, one lives inside > the other, it feels that there are multiple layer, how deep do I have > to dig in? > > Can someone explain the mentality behind this setup? > > Thank you very much! > > M > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Makina Corpus Newsletters | Formations | Twitter Ga?l Pegliasco Chef de projets T?l : 02 51 79 80 84 Portable : 06 41 69 16 09 11 rue du Marchix FR-44000 Nantes -- @GPegliasco -- D?couvrez Talend Data Integration , LA solution d'int?gration de donn?es Open Source -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bckoegajjgeobgik.png Type: image/png Size: 6215 bytes Desc: not available URL: From gael.pegliasco at makina-corpus.com Tue Jun 20 00:32:58 2017 From: gael.pegliasco at makina-corpus.com (=?UTF-8?Q?Ga=c3=abl_Pegliasco?=) Date: Tue, 20 Jun 2017 06:32:58 +0200 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: References: Message-ID: <20a6504a-99c5-5b7b-71d6-4f14c74d2efe@makina-corpus.com> And, answering your last question, a good way to learn Data science using Python is, for I, "Python data science handbook" that you can read as Jupyter notebooks: https://github.com/jakevdp/PythonDataScienceHandbook Le 20/06/2017 ? 06:28, Ga?l Pegliasco via scikit-learn a ?crit : > Hi, > > You may find these R/Python comparison-sheets useful in understanding > both languages syntaxes and concepts: > > * https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis > * http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html > > > Ga?l, > > Le 18/06/2017 ? 18:02, C W a ?crit : >> Dear Scikit-learn, >> >> What are some good ways and resources to learn Python for data analysis? >> >> I am extremely frustrated using this thing. Everything comes after a >> dot! Why would you type the sam thing at the beginning of every line. >> It's not efficient. >> >> code 1: >> y_sin = np.sin(x) >> y_cos = np.cos(x) >> >> I know you can import the entire package without the "as np", but I >> see np.something as the standard. Why? >> >> Code 2: >> model = LogisticRegression() >> model.fit(X_train, y_train) >> model.score(X_test, y_test) >> >> In R, everything is saved to a variable. In the code above, what if I >> accidentally ran model.fit(), I would not know. >> >> Code 3: >> from sklearn import linear_model >> reg = linear_model.Ridge (alpha = .5) >> reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) >> >> In the code above, sklearn > linear_model > Ridge, one lives inside >> the other, it feels that there are multiple layer, how deep do I have >> to dig in? >> >> Can someone explain the mentality behind this setup? >> >> Thank you very much! >> >> M >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Makina Corpus > Newsletters | > Formations | Twitter > > > Ga?l Pegliasco > Chef de projets > T?l : 02 51 79 80 84 > Portable : 06 41 69 16 09 > 11 rue du Marchix FR-44000 Nantes > -- > @GPegliasco > -- > D?couvrez Talend Data Integration > , LA > solution d'int?gration de donn?es Open Source > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Makina Corpus Newsletters | Formations | Twitter Ga?l Pegliasco Chef de projets T?l : 02 51 79 80 84 Portable : 06 41 69 16 09 11 rue du Marchix FR-44000 Nantes -- @GPegliasco -- D?couvrez Talend Data Integration , LA solution d'int?gration de donn?es Open Source -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bckoegajjgeobgik.png Type: image/png Size: 6215 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: demfflofhelojfjn.png Type: image/png Size: 6215 bytes Desc: not available URL: From tmrsg11 at gmail.com Tue Jun 20 01:20:41 2017 From: tmrsg11 at gmail.com (C W) Date: Tue, 20 Jun 2017 01:20:41 -0400 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: <20a6504a-99c5-5b7b-71d6-4f14c74d2efe@makina-corpus.com> References: <20a6504a-99c5-5b7b-71d6-4f14c74d2efe@makina-corpus.com> Message-ID: I am catching up to all the replies, apologies for the delay. (replied in reverse order) @ Ga?l, Thanks for your comments. I actually started with 1) Data Camp courses and 2) Python for Data Science book. Here's my review: 1) The course: it is fantastic! But they only give you a flavor of A FEW things. 2) The book: it is quick crash course, but not enough for you to take off. See code below. # Toy Python Code import numpy as np import pandas as pd N = 100 df = pd.DataFrame({ 'A': pd.date_range(start='2016-01-01',periods=N,freq='D'), 'x': np.linspace(0,stop=N-1,num=N), 'y': np.random.rand(N), 'C': np.random.choice(['Low','Medium','High'],N).tolist(), 'D': np.random.normal(100, 10, size=(N)).tolist() }) df.x len(dir(df)) # end of Python code My confusion: a) df.x gives you column x, but why, I thought things after dot are actions, or more like verbs performed on the object, namely df, in this case. b) len(dir(df)) gives 431. I only crated a dataframe, where did all these 431 things come from? Is there a documentation about this? It scares me because I only asked for a dataframe. @ Gael This is a pretty solid reference. It explains methods among other things, which is awesome! I think method is the barrier to entry for R users. @ Mail Thanks for the details, I will try to pick these computer science terminologies up. It has been a brutal week. @Massimo Yes, I have used that. It is indeed great for one to one equivalence reference. Thanks! On Tue, Jun 20, 2017 at 12:32 AM, Ga?l Pegliasco via scikit-learn < scikit-learn at python.org> wrote: > And, answering your last question, a good way to learn Data science using > Python is, for I, "Python data science handbook" that you can read as > Jupyter notebooks: > > https://github.com/jakevdp/PythonDataScienceHandbook > > > Le 20/06/2017 ? 06:28, Ga?l Pegliasco via scikit-learn a ?crit : > > Hi, > > You may find these R/Python comparison-sheets useful in understanding both > languages syntaxes and concepts: > > > - https://www.datacamp.com/community/tutorials/r-or- > python-for-data-analysis > - http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html > > > Ga?l, > > Le 18/06/2017 ? 18:02, C W a ?crit : > > Dear Scikit-learn, > > What are some good ways and resources to learn Python for data analysis? > > I am extremely frustrated using this thing. Everything comes after a dot! > Why would you type the sam thing at the beginning of every line. It's not > efficient. > > code 1: > y_sin = np.sin(x) > y_cos = np.cos(x) > > I know you can import the entire package without the "as np", but I see > np.something as the standard. Why? > > Code 2: > model = LogisticRegression() > model.fit(X_train, y_train) > model.score(X_test, y_test) > > In R, everything is saved to a variable. In the code above, what if I > accidentally ran model.fit(), I would not know. > > Code 3: > from sklearn import linear_model > reg = linear_model.Ridge (alpha = .5) > reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) > > In the code above, sklearn > linear_model > Ridge, one lives inside the > other, it feels that there are multiple layer, how deep do I have to dig in? > > Can someone explain the mentality behind this setup? > > Thank you very much! > > M > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > -- > [image: Makina Corpus] > Newsletters | > Formations | Twitter > > Ga?l Pegliasco > Chef de projets > T?l : 02 51 79 80 84 > Portable : 06 41 69 16 09 > 11 rue du Marchix FR-44000 Nantes > -- > @GPegliasco > -- > D?couvrez Talend Data Integration > , LA solution > d'int?gration de donn?es Open Source > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > -- > [image: Makina Corpus] > Newsletters | > Formations | Twitter > > Ga?l Pegliasco > Chef de projets > T?l : 02 51 79 80 84 > Portable : 06 41 69 16 09 > 11 rue du Marchix FR-44000 Nantes > -- > @GPegliasco > -- > D?couvrez Talend Data Integration > , LA solution > d'int?gration de donn?es Open Source > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: demfflofhelojfjn.png Type: image/png Size: 6215 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bckoegajjgeobgik.png Type: image/png Size: 6215 bytes Desc: not available URL: From albertthomas88 at gmail.com Tue Jun 20 04:35:23 2017 From: albertthomas88 at gmail.com (Albert Thomas) Date: Tue, 20 Jun 2017 08:35:23 +0000 Subject: [scikit-learn] R user trying to learn Python In-Reply-To: References: <20a6504a-99c5-5b7b-71d6-4f14c74d2efe@makina-corpus.com> Message-ID: You can also have a look at "Effective Computation in Physics" by Anthony Scopatz and Kathryn D. Huff. It gives a very good overview of Python/numpy/pandas... Albert Thomas On Tue, 20 Jun 2017 at 07:25, C W wrote: > I am catching up to all the replies, apologies for the delay. (replied in > reverse order) > > @ Ga?l, > Thanks for your comments. I actually started with 1) Data Camp courses and > 2) Python for Data Science book. > > Here's my review: > 1) The course: it is fantastic! But they only give you a flavor of A FEW > things. > 2) The book: it is quick crash course, but not enough for you to take off. > See code below. > > # Toy Python Code > import numpy as np > import pandas as pd > > N = 100 > df = pd.DataFrame({ > 'A': pd.date_range(start='2016-01-01',periods=N,freq='D'), > 'x': np.linspace(0,stop=N-1,num=N), > 'y': np.random.rand(N), > 'C': np.random.choice(['Low','Medium','High'],N).tolist(), > 'D': np.random.normal(100, 10, size=(N)).tolist() > }) > df.x > len(dir(df)) > # end of Python code > > My confusion: > a) df.x gives you column x, but why, I thought things after dot are > actions, or more like verbs performed on the object, namely df, in this > case. > b) len(dir(df)) gives 431. I only crated a dataframe, where did all these > 431 things come from? Is there a documentation about this? It scares me > because I only asked for a dataframe. > > @ Gael > This is a pretty solid reference. It explains methods among other things, > which is awesome! I think method is the barrier to entry for R users. > > @ Mail > Thanks for the details, I will try to pick these computer science > terminologies up. It has been a brutal week. > > @Massimo > Yes, I have used that. It is indeed great for one to one equivalence > reference. > > Thanks! > > > > > > On Tue, Jun 20, 2017 at 12:32 AM, Ga?l Pegliasco via scikit-learn < > scikit-learn at python.org> wrote: > >> And, answering your last question, a good way to learn Data science using >> Python is, for I, "Python data science handbook" that you can read as >> Jupyter notebooks: >> >> https://github.com/jakevdp/PythonDataScienceHandbook >> >> >> Le 20/06/2017 ? 06:28, Ga?l Pegliasco via scikit-learn a ?crit : >> >> Hi, >> >> You may find these R/Python comparison-sheets useful in understanding >> both languages syntaxes and concepts: >> >> >> - >> https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis >> - http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html >> >> >> Ga?l, >> >> Le 18/06/2017 ? 18:02, C W a ?crit : >> >> Dear Scikit-learn, >> >> What are some good ways and resources to learn Python for data analysis? >> >> I am extremely frustrated using this thing. Everything comes after a dot! >> Why would you type the sam thing at the beginning of every line. It's not >> efficient. >> >> code 1: >> y_sin = np.sin(x) >> y_cos = np.cos(x) >> >> I know you can import the entire package without the "as np", but I see >> np.something as the standard. Why? >> >> Code 2: >> model = LogisticRegression() >> model.fit(X_train, y_train) >> model.score(X_test, y_test) >> >> In R, everything is saved to a variable. In the code above, what if I >> accidentally ran model.fit(), I would not know. >> >> Code 3: >> from sklearn import linear_model >> reg = linear_model.Ridge (alpha = .5) >> reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) >> >> In the code above, sklearn > linear_model > Ridge, one lives inside the >> other, it feels that there are multiple layer, how deep do I have to dig in? >> >> Can someone explain the mentality behind this setup? >> >> Thank you very much! >> >> M >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> -- >> [image: bckoegajjgeobgik.png] >> Newsletters | >> Formations | Twitter >> >> Ga?l Pegliasco >> Chef de projets >> T?l : 02 51 79 80 84 >> Portable : 06 41 69 16 09 >> 11 rue du Marchix FR-44000 Nantes >> -- >> @GPegliasco >> -- >> D?couvrez Talend Data Integration >> , LA solution >> d'int?gration de donn?es Open Source >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> -- >> [image: demfflofhelojfjn.png] >> Newsletters | >> Formations | Twitter >> >> Ga?l Pegliasco >> Chef de projets >> T?l : 02 51 79 80 84 >> Portable : 06 41 69 16 09 >> 11 rue du Marchix FR-44000 Nantes >> -- >> @GPegliasco >> -- >> D?couvrez Talend Data Integration >> , LA solution >> d'int?gration de donn?es Open Source >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: demfflofhelojfjn.png Type: image/png Size: 6215 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bckoegajjgeobgik.png Type: image/png Size: 6215 bytes Desc: not available URL: From t3kcit at gmail.com Wed Jun 21 00:01:06 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 21 Jun 2017 00:01:06 -0400 Subject: [scikit-learn] ANN scikit-learn 0.18.2 numpy 1.12 compatibility release Message-ID: Hey everybody. I just pushed the minor release 0.18.2 to pypi. This release contains minor fixed to ensure compatibility with numpy 1.12, mostly for the examples. There is a small fix in the gaussian process module, too. Check the changelog here: http://scikit-learn.org/stable/whats_new.html#version-0-18-2 Stay tuned for the 0.19 release candidate, that we're putting together right now! Cheers, Andy From t3kcit at gmail.com Wed Jun 21 00:10:30 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 21 Jun 2017 00:10:30 -0400 Subject: [scikit-learn] ANN scikit-learn 0.18.2 numpy 1.12 compatibility release In-Reply-To: References: Message-ID: <595d647f-ae4b-2526-c980-825096698522@gmail.com> Maybe one of these days I'll get a release announcement right. Fixes for numpy 1.13, of course. On 06/21/2017 12:01 AM, Andreas Mueller wrote: > Hey everybody. > > I just pushed the minor release 0.18.2 to pypi. > This release contains minor fixed to ensure compatibility with numpy > 1.12, > mostly for the examples. > There is a small fix in the gaussian process module, too. > > Check the changelog here: > http://scikit-learn.org/stable/whats_new.html#version-0-18-2 > > Stay tuned for the 0.19 release candidate, that we're putting together > right now! > > Cheers, > Andy From gael.varoquaux at normalesup.org Thu Jun 22 08:58:27 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 22 Jun 2017 14:58:27 +0200 Subject: [scikit-learn] ANN scikit-learn 0.18.2 numpy 1.12 compatibility release In-Reply-To: References: Message-ID: <20170622125827.GC246543@phare.normalesup.org> Thank you so much Andy for this. It provides a lot of value to our users, and it's actually tough work. Ga?l On Wed, Jun 21, 2017 at 12:01:06AM -0400, Andreas Mueller wrote: > Hey everybody. > I just pushed the minor release 0.18.2 to pypi. > This release contains minor fixed to ensure compatibility with numpy 1.12, > mostly for the examples. > There is a small fix in the gaussian process module, too. > Check the changelog here: > http://scikit-learn.org/stable/whats_new.html#version-0-18-2 > Stay tuned for the 0.19 release candidate, that we're putting together right > now! > Cheers, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From joel.nothman at gmail.com Thu Jun 22 09:30:06 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 22 Jun 2017 23:30:06 +1000 Subject: [scikit-learn] ANN scikit-learn 0.18.2 numpy 1.12 compatibility release In-Reply-To: <20170622125827.GC246543@phare.normalesup.org> References: <20170622125827.GC246543@phare.normalesup.org> Message-ID: +1 On 22 June 2017 at 22:58, Gael Varoquaux wrote: > Thank you so much Andy for this. It provides a lot of value to our users, > and it's actually tough work. > > Ga?l > > On Wed, Jun 21, 2017 at 12:01:06AM -0400, Andreas Mueller wrote: > > Hey everybody. > > > I just pushed the minor release 0.18.2 to pypi. > > This release contains minor fixed to ensure compatibility with numpy > 1.12, > > mostly for the examples. > > There is a small fix in the gaussian process module, too. > > > Check the changelog here: > > http://scikit-learn.org/stable/whats_new.html#version-0-18-2 > > > Stay tuned for the 0.19 release candidate, that we're putting together > right > > now! > > > Cheers, > > Andy > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Thu Jun 22 17:33:52 2017 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Thu, 22 Jun 2017 23:33:52 +0200 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: References: Message-ID: Dear all, I posted the full question on StackOverflow and as it contains some figures I refer you to that post. https://stackoverflow.com/questions/44661926/sample- weight-parameter-shape-error-in-scikit-learn-gridsearchcv/44662285#44662285 I currently believe that this issue is a bug and I opened an issue on GitHub. To sum up, the issue is that GridSearchCV does not handle the splitting of the sample_weight vector during cross validation. Nota bene: cross_val_score seems to handle the splitting OK, this issue seems to occurr only in GridSearchCV. Any comments enlightening me and showing me how wrong I am are most welcome. -------------- next part -------------- An HTML attachment was scrubbed... URL: From julio at esbet.es Thu Jun 22 17:47:53 2017 From: julio at esbet.es (Julio Antonio Soto de Vicente) Date: Thu, 22 Jun 2017 23:47:53 +0200 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: References: Message-ID: Hi Manuel, Are you sure that you are using the latest version (or at least >0.17)? The code for splitting the sample weights in GridSearchCV has been there for a while now... -- Julio > El 22 jun 2017, a las 23:33, Manuel Castej?n Limas escribi?: > > Dear all, > I posted the full question on StackOverflow and as it contains some figures I refer you to that post. > > https://stackoverflow.com/questions/44661926/sample-weight-parameter-shape-error-in-scikit-learn-gridsearchcv/44662285#44662285 > > I currently believe that this issue is a bug and I opened an issue on GitHub. > > To sum up, the issue is that GridSearchCV does not handle the splitting of the sample_weight vector during cross validation. > > Nota bene: cross_val_score seems to handle the splitting OK, this issue seems to occurr only in GridSearchCV. > > Any comments enlightening me and showing me how wrong I am are most welcome. > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jun 22 19:02:39 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 23 Jun 2017 09:02:39 +1000 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: References: Message-ID: why are you passing [my_sample_weights] rather than just my_sample_weights? On 23 Jun 2017 7:49 am, "Julio Antonio Soto de Vicente" wrote: > Hi Manuel, > > Are you sure that you are using the latest version (or at least >0.17)? > The code for splitting the sample weights in GridSearchCV has been there > for a while now... > > -- > Julio > > El 22 jun 2017, a las 23:33, Manuel Castej?n Limas < > manuel.castejon at gmail.com> escribi?: > > Dear all, > I posted the full question on StackOverflow and as it contains some > figures I refer you to that post. > > https://stackoverflow.com/questions/44661926/sample-weight- > parameter-shape-error-in-scikit-learn-gridsearchcv/44662285#44662285 > > I currently believe that this issue is a bug and I opened an issue on > GitHub. > > To sum up, the issue is that GridSearchCV does not handle the splitting of > the sample_weight vector during cross validation. > > Nota bene: cross_val_score seems to handle the splitting OK, this issue > seems to occurr only in GridSearchCV. > > Any comments enlightening me and showing me how wrong I am are most > welcome. > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcasl at unileon.es Thu Jun 22 19:13:45 2017 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Fri, 23 Jun 2017 01:13:45 +0200 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: References: Message-ID: Hello Antonio, Sure: import sklearn print(sklearn.__version__) 0.18.1 The error suggests that the fit function is expecting a split vector with size 2/3*1000 but the whole vector (size 1000) is passed. ... ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast. El 22 jun. 2017 11:49 p. m., "Julio Antonio Soto de Vicente" escribi?: Hi Manuel, Are you sure that you are using the latest version (or at least >0.17)? The code for splitting the sample weights in GridSearchCV has been there for a while now... -- Julio El 22 jun 2017, a las 23:33, Manuel Castej?n Limas < manuel.castejon at gmail.com> escribi?: Dear all, I posted the full question on StackOverflow and as it contains some figures I refer you to that post. https://stackoverflow.com/questions/44661926/sample-weight- parameter-shape-error-in-scikit-learn-gridsearchcv/44662285#44662285 I currently believe that this issue is a bug and I opened an issue on GitHub. To sum up, the issue is that GridSearchCV does not handle the splitting of the sample_weight vector during cross validation. Nota bene: cross_val_score seems to handle the splitting OK, this issue seems to occurr only in GridSearchCV. Any comments enlightening me and showing me how wrong I am are most welcome. _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcasl at unileon.es Thu Jun 22 19:17:29 2017 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Fri, 23 Jun 2017 01:17:29 +0200 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: References: Message-ID: Dear Joel, I'm just passing an iterable as I would do with any other sequence of parameters to tune. In this case the list only has one element to use but in general I ought to be able to pass a collection of vectors. Anyway, I guess that that issue is not the cause of the problem. El 23 jun. 2017 1:04 a. m., "Joel Nothman" escribi?: > why are you passing [my_sample_weights] rather than just my_sample_weights? > > On 23 Jun 2017 7:49 am, "Julio Antonio Soto de Vicente" > wrote: > >> Hi Manuel, >> >> Are you sure that you are using the latest version (or at least >0.17)? >> The code for splitting the sample weights in GridSearchCV has been there >> for a while now... >> >> -- >> Julio >> >> El 22 jun 2017, a las 23:33, Manuel Castej?n Limas < >> manuel.castejon at gmail.com> escribi?: >> >> Dear all, >> I posted the full question on StackOverflow and as it contains some >> figures I refer you to that post. >> >> https://stackoverflow.com/questions/44661926/sample-weight-p >> arameter-shape-error-in-scikit-learn-gridsearchcv/44662285#44662285 >> >> I currently believe that this issue is a bug and I opened an issue on >> GitHub. >> >> To sum up, the issue is that GridSearchCV does not handle the splitting >> of the sample_weight vector during cross validation. >> >> Nota bene: cross_val_score seems to handle the splitting OK, this issue >> seems to occurr only in GridSearchCV. >> >> Any comments enlightening me and showing me how wrong I am are most >> welcome. >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Fri Jun 23 04:34:41 2017 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Fri, 23 Jun 2017 10:34:41 +0200 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: References: Message-ID: Dear Joel, I tried and removed the square brackets and now it works as expected *for a single* sample_weight vector: validator = GridSearchCV(my_Regressor, param_grid={'number_of_hidden_neurons': range(4, 5), 'epochs': [50], }, fit_params={'sample_weight': my_sample_weights }, n_jobs=1, ) validator.fit(x, y) The problem now is that I want to try multiple trainings with multiple sample_weight parameters, in the following fashion: validator = GridSearchCV(my_Regressor, param_grid={'number_of_hidden_neurons': range(4, 5), 'epochs': [50], 'sample_weight': [my_sample_weights, my_sample_weights**2] , }, fit_params={}, n_jobs=1, ) validator.fit(x, y) But unfortunately it produces the same error again: ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast. I guess that the issue is that the sample__weight parameter was not thought to be changed during the tuning, was it? Thank you all for your patience and support. Best Manolo 2017-06-23 1:17 GMT+02:00 Manuel CASTEJ?N LIMAS : > Dear Joel, > I'm just passing an iterable as I would do with any other sequence of > parameters to tune. In this case the list only has one element to use but > in general I ought to be able to pass a collection of vectors. > Anyway, I guess that that issue is not the cause of the problem. > > El 23 jun. 2017 1:04 a. m., "Joel Nothman" > escribi?: > >> why are you passing [my_sample_weights] rather than just >> my_sample_weights? >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian at ianozsvald.com Fri Jun 23 13:12:24 2017 From: ian at ianozsvald.com (Ian Ozsvald) Date: Fri, 23 Jun 2017 18:12:24 +0100 Subject: [scikit-learn] Query about use of standard deviation on tree feature_importances_ in demo plot_forest_importances.html Message-ID: Hi all. I'm looking at the code behind one of the tree ensemble demos: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html and I'm unsure about the error bars. They are calculated using the standard deviation of the feature_importances_ attribute across trees. Can we depend on this being a Normal distribution? I'm wondering if the plot tells enough of the story to be genuinely useful? I don't have a strong belief in the likely distribution of feature_importances_, I haven't dug into how the feature importances are calculated (frankly I'm a bit lost here). I know that on a RF Regression case I'm working on I can see unimodal and bimodal feature importance distributions - this came up on a discussion on the yellowbrick sklearn visualisation package: https://github.com/DistrictDataLabs/yellowbrick/pull/195 I don't know what is "normal" for feature importances and if they look different between classification tasks (as in the plot_forest_importances demo) and regression tasks. Maybe I've got an outlier in my task? If I use the provided demo code then my error bars can go negative, so that feels unhelpful. Does anyone have an opinion? Perhaps more importantly - is a visual indication of the spread of feature importances in an ensemble actually a useful thing to plot? Does it serve a diagnostic value? I saw Sebastian Raschka's reference to Gilles Louppe et al.'s NIPS paper (in here, 2016-05-17) on variable importances, I'll dig into that if nobody has a strong opinion. BTW Sebastian - thanks for writing your book. Cheers, Ian. -- Ian Ozsvald (Data Scientist, PyDataLondon co-chair) ian at IanOzsvald.com http://IanOzsvald.com http://ModelInsight.io http://twitter.com/IanOzsvald From olivier.grisel at ensta.org Fri Jun 23 13:51:09 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 23 Jun 2017 19:51:09 +0200 Subject: [scikit-learn] Query about use of standard deviation on tree feature_importances_ in demo plot_forest_importances.html In-Reply-To: References: Message-ID: +1 for changing this example to have error bars represent 5 & 95 percentiles or 25 and 75 percentiles (quartiles). Or event bootstrapped confidence intervals or the mean feature importance for each variable. This might be a bit too verbose for an example though. > Perhaps more importantly - is a visual indication of the spread of feature importances in an ensemble actually a useful thing to plot? Does it serve a diagnostic value? Yes. Otherwise people might be over-confident in the stability of those feature importances. -- Olivier From olivier.grisel at ensta.org Fri Jun 23 17:15:20 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 23 Jun 2017 23:15:20 +0200 Subject: [scikit-learn] Scikit-learn workshop and sprint at EuroScipy 2017 in Erlangen Message-ID: Hi all, FYI I have just submitted a 90 min tutorial on scikit-learn to the EuroScipy CFP. If anybody is interested in co-teaching / TA-ing this workshop please let me know. I also plan to stay for the one-day sprint to help people make their first contribution to the project. Last year we had great fun and the sprint was very productive. Registration is now open: https://www.euroscipy.org/2017/ 10th European Conference on Python in Science Location: Erlangen August 28-29 (Mon, Tue) Tutorials / Workshops August 30 - 31 (Wed, Thu) Main conference and posters September 1 (Fri) Sprints See you in Erlangen! -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel From ian at ianozsvald.com Sat Jun 24 05:17:51 2017 From: ian at ianozsvald.com (Ian Ozsvald) Date: Sat, 24 Jun 2017 10:17:51 +0100 Subject: [scikit-learn] Query about use of standard deviation on tree feature_importances_ in demo plot_forest_importances.html In-Reply-To: References: Message-ID: Good. I'd suggested a box plot or use of IQR (on a bar chart) on the yellowbrick list. I was assuming that if distribution of feature importances contained many '0's might indeed be worth highlighting as a diagnostic. Cheers, Ian. On 23 June 2017 at 18:51, Olivier Grisel wrote: > +1 for changing this example to have error bars represent 5 & 95 > percentiles or 25 and 75 percentiles (quartiles). > > Or event bootstrapped confidence intervals or the mean feature > importance for each variable. This might be a bit too verbose for an > example though. > >> Perhaps more importantly - is a visual > indication of the spread of feature importances in an ensemble > actually a useful thing to plot? Does it serve a diagnostic value? > > Yes. Otherwise people might be over-confident in the stability of > those feature importances. > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Ian Ozsvald (Data Scientist, PyDataLondon co-chair) ian at IanOzsvald.com http://IanOzsvald.com http://ModelInsight.io http://twitter.com/IanOzsvald From joel.nothman at gmail.com Sat Jun 24 09:51:14 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 24 Jun 2017 23:51:14 +1000 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: References: Message-ID: yes, trying multiple sample weightings is not supported by grid search directly. On 23 Jun 2017 6:36 pm, "Manuel Castej?n Limas" wrote: > Dear Joel, > > I tried and removed the square brackets and now it works as expected *for > a single* sample_weight vector: > > validator = GridSearchCV(my_Regressor, > param_grid={'number_of_hidden_neurons': range(4, 5), > 'epochs': [50], > }, > fit_params={'sample_weight': my_sample_weights }, > n_jobs=1, > ) > validator.fit(x, y) > > The problem now is that I want to try multiple trainings with multiple > sample_weight parameters, in the following fashion: > > validator = GridSearchCV(my_Regressor, > param_grid={'number_of_hidden_neurons': range(4, 5), > 'epochs': [50], > 'sample_weight': [my_sample_weights, my_sample_weights**2] , > }, > fit_params={}, > n_jobs=1, > ) > validator.fit(x, y) > > But unfortunately it produces the same error again: > > ValueError: Found a sample_weight array with shape (1000,) for an input > with shape (666, 1). sample_weight cannot be broadcast. > > I guess that the issue is that the sample__weight parameter was not > thought to be changed during the tuning, was it? > > > Thank you all for your patience and support. > Best > Manolo > > > > > 2017-06-23 1:17 GMT+02:00 Manuel CASTEJ?N LIMAS : > >> Dear Joel, >> I'm just passing an iterable as I would do with any other sequence of >> parameters to tune. In this case the list only has one element to use but >> in general I ought to be able to pass a collection of vectors. >> Anyway, I guess that that issue is not the cause of the problem. >> >> El 23 jun. 2017 1:04 a. m., "Joel Nothman" >> escribi?: >> >>> why are you passing [my_sample_weights] rather than just >>> my_sample_weights? >>> >>> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julio at esbet.es Sat Jun 24 14:03:57 2017 From: julio at esbet.es (Julio Antonio Soto de Vicente) Date: Sat, 24 Jun 2017 20:03:57 +0200 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: References: Message-ID: <211ACCD2-E744-48FA-AFDA-D0C1AC4BA5E8@esbet.es> Joel is right. In fact, you usually don't want to tune a lot the sample weights: you may leave them default, set them in order to balance classes, or fix them according to some business rule. That said, you can always run a couple of grid searchs changing that sample weights and compare results afterwards. -- Julio > El 24 jun 2017, a las 15:51, Joel Nothman escribi?: > > yes, trying multiple sample weightings is not supported by grid search directly. > >> On 23 Jun 2017 6:36 pm, "Manuel Castej?n Limas" wrote: >> Dear Joel, >> >> I tried and removed the square brackets and now it works as expected for a single sample_weight vector: >> >> validator = GridSearchCV(my_Regressor, >> param_grid={'number_of_hidden_neurons': range(4, 5), >> 'epochs': [50], >> }, >> fit_params={'sample_weight': my_sample_weights }, >> n_jobs=1, >> ) >> validator.fit(x, y) >> The problem now is that I want to try multiple trainings with multiple sample_weight parameters, in the following fashion: >> >> validator = GridSearchCV(my_Regressor, >> param_grid={'number_of_hidden_neurons': range(4, 5), >> 'epochs': [50], >> 'sample_weight': [my_sample_weights, my_sample_weights**2] , >> }, >> fit_params={}, >> n_jobs=1, >> ) >> validator.fit(x, y) >> But unfortunately it produces the same error again: >> >> ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast. >> >> I guess that the issue is that the sample__weight parameter was not thought to be changed during the tuning, was it? >> >> >> Thank you all for your patience and support. >> Best >> Manolo >> >> >> >> >> 2017-06-23 1:17 GMT+02:00 Manuel CASTEJ?N LIMAS : >>> Dear Joel, >>> I'm just passing an iterable as I would do with any other sequence of parameters to tune. In this case the list only has one element to use but in general I ought to be able to pass a collection of vectors. >>> Anyway, I guess that that issue is not the cause of the problem. >>> >>> El 23 jun. 2017 1:04 a. m., "Joel Nothman" escribi?: >>>> why are you passing [my_sample_weights] rather than just my_sample_weights? >>>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From betatim at gmail.com Sun Jun 25 03:09:01 2017 From: betatim at gmail.com (Tim Head) Date: Sun, 25 Jun 2017 07:09:01 +0000 Subject: [scikit-learn] Scikit-learn workshop and sprint at EuroScipy 2017 in Erlangen In-Reply-To: References: Message-ID: Hi Olivier, On Fri, Jun 23, 2017 at 11:16 PM Olivier Grisel wrote: > Hi all, > > FYI I have just submitted a 90 min tutorial on scikit-learn to the > EuroScipy CFP. If anybody is interested in co-teaching / TA-ing this > workshop please let me know. > I will be at EuroScipy and interested in co-teaching. T -------------- next part -------------- An HTML attachment was scrubbed... URL: From parmsingh129 at gmail.com Sun Jun 25 07:48:09 2017 From: parmsingh129 at gmail.com (Parminder Singh) Date: Sun, 25 Jun 2017 17:18:09 +0530 Subject: [scikit-learn] [Feature] drop_one in one hot encoder Message-ID: Hy Sci-kittens! :-) I was doing machine learning a-z course on Udemy, there they told that every time one-hot encoding is done, one of the columns should be dropped as it is like doubling same category twice and redundant to model. I thought if instead of having user find the index and drop it after preprocessing, OneHotEncoder had a drop_one variable, and it automatically removed the last column. What are your thoughts about this? I am new to this community, would like to contribute this myself if it is possible addition. Thanks, Trion129 From se.raschka at gmail.com Sun Jun 25 12:06:24 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 25 Jun 2017 12:06:24 -0400 Subject: [scikit-learn] [Feature] drop_one in one hot encoder In-Reply-To: References: Message-ID: <8945128A-81DA-49AC-AFF6-42D143840F87@gmail.com> Hi, hm, I think that dropping a column in onehot encoded features is quite uncommon in machine learning practice -- based on the applications and implementations I've seen. My guess is that the onehot encoded features are multicolinear anyway!? There may be certain algorithms that benefit from dropping a column, though (e.g., linear regression as a simple example). For instance, pandas' get_dummies has a "drop_first" parameter ... I think it would make sense to have such a parameter in the onehotencoder as well, e.g., for working with pipelines. Best, Sebastian > On Jun 25, 2017, at 7:48 AM, Parminder Singh wrote: > > Hy Sci-kittens! :-) > > I was doing machine learning a-z course on Udemy, there they told that every time one-hot encoding is done, one of the columns should be dropped as it is like doubling same category twice and redundant to model. I thought if instead of having user find the index and drop it after preprocessing, OneHotEncoder had a drop_one variable, and it automatically removed the last column. What are your thoughts about this? I am new to this community, would like to contribute this myself if it is possible addition. > > Thanks, > Trion129 > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Sun Jun 25 13:01:10 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sun, 25 Jun 2017 19:01:10 +0200 Subject: [scikit-learn] [Feature] drop_one in one hot encoder In-Reply-To: References: Message-ID: <20170625170110.GB2210289@phare.normalesup.org> On Sun, Jun 25, 2017 at 05:18:09PM +0530, Parminder Singh wrote: > Hy Sci-kittens! :-) Nice :). FYI: there is work in progress to replace the OneHotEncoder, as it has many strong limitations: https://github.com/scikit-learn/scikit-learn/pull/9151 It might be useful to have a look at this PR to make sure that it solves the various use cases. Ga?l > I was doing machine learning a-z course on Udemy, there they told that > every time one-hot encoding is done, one of the columns should be dropped > as it is like doubling same category twice and redundant to model. I > thought if instead of having user find the index and drop it after > preprocessing, OneHotEncoder had a drop_one variable, and it automatically > removed the last column. What are your thoughts about this? I am new to > this community, would like to contribute this myself if it is possible > addition. > Thanks, > Trion129 > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From mcasl at unileon.es Mon Jun 26 02:43:32 2017 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Mon, 26 Jun 2017 08:43:32 +0200 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: <211ACCD2-E744-48FA-AFDA-D0C1AC4BA5E8@esbet.es> References: <211ACCD2-E744-48FA-AFDA-D0C1AC4BA5E8@esbet.es> Message-ID: Yes, I guess most users will be happy without using weights. Some will need to use one single vector, but I am currently researching a weighting method thus my need of evaluating multiple weight vectors. I understand that it seems to be a very specific issue with a simple workaround, most likely not worthy of any programming effort yet as there are more important issues to address. I guess that adding a note on this behaviour on the documentation could be great. If some parameters can be iterated and others are not supported knowing it provides a more solid ground to the user base. I'm committed to spend a few hours studying the code. Should I be successful I will come again with a pull request. I'll cross my fingers :-) Best Manolo El 24 jun. 2017 20:05, "Julio Antonio Soto de Vicente" escribi?: Joel is right. In fact, you usually don't want to tune a lot the sample weights: you may leave them default, set them in order to balance classes, or fix them according to some business rule. That said, you can always run a couple of grid searchs changing that sample weights and compare results afterwards. -- Julio El 24 jun 2017, a las 15:51, Joel Nothman escribi?: yes, trying multiple sample weightings is not supported by grid search directly. On 23 Jun 2017 6:36 pm, "Manuel Castej?n Limas" wrote: > Dear Joel, > > I tried and removed the square brackets and now it works as expected *for > a single* sample_weight vector: > > validator = GridSearchCV(my_Regressor, > param_grid={'number_of_hidden_neurons': range(4, 5), > 'epochs': [50], > }, > fit_params={'sample_weight': my_sample_weights }, > n_jobs=1, > ) > validator.fit(x, y) > > The problem now is that I want to try multiple trainings with multiple > sample_weight parameters, in the following fashion: > > validator = GridSearchCV(my_Regressor, > param_grid={'number_of_hidden_neurons': range(4, 5), > 'epochs': [50], > 'sample_weight': [my_sample_weights, my_sample_weights**2] , > }, > fit_params={}, > n_jobs=1, > ) > validator.fit(x, y) > > But unfortunately it produces the same error again: > > ValueError: Found a sample_weight array with shape (1000,) for an input > with shape (666, 1). sample_weight cannot be broadcast. > > I guess that the issue is that the sample__weight parameter was not > thought to be changed during the tuning, was it? > > > Thank you all for your patience and support. > Best > Manolo > > > > > 2017-06-23 1:17 GMT+02:00 Manuel CASTEJ?N LIMAS : > >> Dear Joel, >> I'm just passing an iterable as I would do with any other sequence of >> parameters to tune. In this case the list only has one element to use but >> in general I ought to be able to pass a collection of vectors. >> Anyway, I guess that that issue is not the cause of the problem. >> >> El 23 jun. 2017 1:04 a. m., "Joel Nothman" >> escribi?: >> >>> why are you passing [my_sample_weights] rather than just >>> my_sample_weights? >>> >>> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Jun 26 03:17:02 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 26 Jun 2017 17:17:02 +1000 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: References: <211ACCD2-E744-48FA-AFDA-D0C1AC4BA5E8@esbet.es> Message-ID: I don't think we'll be accepting a pull request adding this feature to scikit-learn. It is too niche. But you should go ahead and modify the search to operate over weightings for your own research. If you feel the documentation can be clarified, a pull request there is welcome. On 26 June 2017 at 16:43, Manuel CASTEJ?N LIMAS wrote: > Yes, I guess most users will be happy without using weights. Some will > need to use one single vector, but I am currently researching a weighting > method thus my need of evaluating multiple weight vectors. > > I understand that it seems to be a very specific issue with a simple > workaround, most likely not worthy of any programming effort yet as there > are more important issues to address. > > I guess that adding a note on this behaviour on the documentation could be > great. If some parameters can be iterated and others are not supported > knowing it provides a more solid ground to the user base. > > I'm committed to spend a few hours studying the code. Should I be > successful I will come again with a pull request. > I'll cross my fingers :-) > Best > Manolo > > > > El 24 jun. 2017 20:05, "Julio Antonio Soto de Vicente" > escribi?: > > Joel is right. > > In fact, you usually don't want to tune a lot the sample weights: you may > leave them default, set them in order to balance classes, or fix them > according to some business rule. > > That said, you can always run a couple of grid searchs changing that > sample weights and compare results afterwards. > > -- > Julio > > El 24 jun 2017, a las 15:51, Joel Nothman > escribi?: > > yes, trying multiple sample weightings is not supported by grid search > directly. > > On 23 Jun 2017 6:36 pm, "Manuel Castej?n Limas" > wrote: > >> Dear Joel, >> >> I tried and removed the square brackets and now it works as expected *for >> a single* sample_weight vector: >> >> validator = GridSearchCV(my_Regressor, >> param_grid={'number_of_hidden_neurons': range(4, 5), >> 'epochs': [50], >> }, >> fit_params={'sample_weight': my_sample_weights }, >> n_jobs=1, >> ) >> validator.fit(x, y) >> >> The problem now is that I want to try multiple trainings with multiple >> sample_weight parameters, in the following fashion: >> >> validator = GridSearchCV(my_Regressor, >> param_grid={'number_of_hidden_neurons': range(4, 5), >> 'epochs': [50], >> 'sample_weight': [my_sample_weights, my_sample_weights**2] , >> }, >> fit_params={}, >> n_jobs=1, >> ) >> validator.fit(x, y) >> >> But unfortunately it produces the same error again: >> >> ValueError: Found a sample_weight array with shape (1000,) for an input >> with shape (666, 1). sample_weight cannot be broadcast. >> >> I guess that the issue is that the sample__weight parameter was not >> thought to be changed during the tuning, was it? >> >> >> Thank you all for your patience and support. >> Best >> Manolo >> >> >> >> >> 2017-06-23 1:17 GMT+02:00 Manuel CASTEJ?N LIMAS : >> >>> Dear Joel, >>> I'm just passing an iterable as I would do with any other sequence of >>> parameters to tune. In this case the list only has one element to use but >>> in general I ought to be able to pass a collection of vectors. >>> Anyway, I guess that that issue is not the cause of the problem. >>> >>> El 23 jun. 2017 1:04 a. m., "Joel Nothman" >>> escribi?: >>> >>>> why are you passing [my_sample_weights] rather than just >>>> my_sample_weights? >>>> >>>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Mon Jun 26 12:28:13 2017 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Mon, 26 Jun 2017 09:28:13 -0700 Subject: [scikit-learn] Fwd: [SciPy-User] EuroSciPy 2017 call for contributions - extension of deadline In-Reply-To: <20170626104953.GA16487@pi-x230> References: <20170626104953.GA16487@pi-x230> Message-ID: Hi everyone, I thought some of you might be interested in this dead line extension. Cheers, N ---------- Forwarded message ---------- From: Pierre de Buyl Date: 26 June 2017 at 03:49 Subject: [SciPy-User] EuroSciPy 2017 call for contributions - extension of deadline To: scipy-user at python.org, numpy-discussion at python.org (Apologies if you receive multiple copies of this message) 10th European Conference on Python in Science August 28 - September 1, 2017 in Erlangen, Germany The Call for Papers is extended to July 02, 2017 23:00 CEST Description: The EuroSciPy meeting is a cross-disciplinary gathering focused on the use and development of the Python language in scientific research. This event strives to bring together both users and developers of scientific tools, as well as academic research and state of the art industry. Erlangen is one of Germany's major science hubs and located north of Munich (90 minutes by train). The Call for Papers is extended to July 02, 2017 23:00 CEST Regards, The EuroSciPy team https://www.euroscipy.org/2017/ _______________________________________________ SciPy-User mailing list SciPy-User at python.org https://mail.python.org/mailman/listinfo/scipy-user From t3kcit at gmail.com Tue Jun 27 13:43:44 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 27 Jun 2017 13:43:44 -0400 Subject: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV In-Reply-To: References: <211ACCD2-E744-48FA-AFDA-D0C1AC4BA5E8@esbet.es> Message-ID: <254a293f-03fc-ffb5-6da6-071a7aeada0a@gmail.com> We could clarify in the documentation that you can grid-search any (hyper) parameter of a model, but not parameters to fit? Only the values returned by get_params() can be tuned. Only "param_grid" will be searched, not "fit_params". "fit_params" can contain only a single setting. On 06/26/2017 03:17 AM, Joel Nothman wrote: > I don't think we'll be accepting a pull request adding this feature to > scikit-learn. It is too niche. But you should go ahead and modify the > search to operate over weightings for your own research. If you feel > the documentation can be clarified, a pull request there is welcome. > > On 26 June 2017 at 16:43, Manuel CASTEJ?N LIMAS > wrote: > > Yes, I guess most users will be happy without using weights. Some > will need to use one single vector, but I am currently researching > a weighting method thus my need of evaluating multiple weight vectors. > > I understand that it seems to be a very specific issue with a > simple workaround, most likely not worthy of any programming > effort yet as there are more important issues to address. > > I guess that adding a note on this behaviour on the documentation > could be great. If some parameters can be iterated and others are > not supported knowing it provides a more solid ground to the user > base. > > I'm committed to spend a few hours studying the code. Should I be > successful I will come again with a pull request. > I'll cross my fingers :-) > Best > Manolo > > > > El 24 jun. 2017 20:05, "Julio Antonio Soto de Vicente" > > escribi?: > > Joel is right. > > In fact, you usually don't want to tune a lot the sample > weights: you may leave them default, set them in order to > balance classes, or fix them according to some business rule. > > That said, you can always run a couple of grid searchs > changing that sample weights and compare results afterwards. > > -- > Julio > > El 24 jun 2017, a las 15:51, Joel Nothman > > escribi?: > >> yes, trying multiple sample weightings is not supported by >> grid search directly. >> >> On 23 Jun 2017 6:36 pm, "Manuel Castej?n Limas" >> > > wrote: >> >> Dear Joel, >> >> I tried and removed the square brackets and now it works >> as expected *for a single* sample_weight vector: >> >> |validator = GridSearchCV(my_Regressor, >> param_grid={'number_of_hidden_neurons': range(4, 5), >> 'epochs': [50], }, fit_params={'sample_weight': >> my_sample_weights }, n_jobs=1, ) validator.fit(x, y)| >> >> The problem now is that I want to try multiple trainings >> with multiple sample_weight parameters, in the following >> fashion: >> >> |validator = GridSearchCV(my_Regressor, >> param_grid={'number_of_hidden_neurons': range(4, 5), >> 'epochs': [50], 'sample_weight': [my_sample_weights, >> my_sample_weights**2] , }, fit_params={}, n_jobs=1, ) >> validator.fit(x, y)| >> >> But unfortunately it produces the same error again: >> >> ValueError: Found a sample_weight array with shape >> (1000,) for an input with shape (666, 1). sample_weight >> cannot be broadcast. >> >> I guess that the issue is that the sample__weight >> parameter was not thought to be changed during the >> tuning, was it? >> >> >> Thank you all for your patience and support. >> Best >> Manolo >> >> >> >> >> 2017-06-23 1:17 GMT+02:00 Manuel CASTEJ?N LIMAS >> >: >> >> Dear Joel, >> I'm just passing an iterable as I would do with any >> other sequence of parameters to tune. In this case >> the list only has one element to use but in general I >> ought to be able to pass a collection of vectors. >> Anyway, I guess that that issue is not the cause of >> the problem. >> >> El 23 jun. 2017 1:04 a. m., "Joel Nothman" >> > > escribi?: >> >> why are you passing [my_sample_weights] rather >> than just my_sample_weights? >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Wed Jun 28 03:42:36 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 28 Jun 2017 09:42:36 +0200 Subject: [scikit-learn] Scikit-learn workshop and sprint at EuroScipy 2017 in Erlangen In-Reply-To: References: Message-ID: Hi Tim, Thanks for the help. I was planning to do a quick sklearn intro based on slides such as the first part of: https://speakerdeck.com/ogrisel/intro-to-scikit-learn-and-whats-new-in-0-dot-17 (but I would like to re-do them in HTML with remark.js as I do here: https://github.com/ogrisel/decks/tree/gh-pages) and then run through a notebook such as : https://github.com/ogrisel/notebooks/blob/master/sklearn_demos/Income%20classification.ipynb Do you have any suggestion ? The workshop duration is 90 min. -- Olivier From b.noushin7 at gmail.com Thu Jun 29 17:26:37 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Thu, 29 Jun 2017 17:26:37 -0400 Subject: [scikit-learn] Agglomerative clustering Message-ID: I have some data and also the pairwise distance matrix of these data points. I want to cluster them using Agglomerative clustering. I readthat in sklearn, we can have 'precomputed' as affinity and I expect it is the distance matrix. But I could not find any example which uses precomputed affinity and a custom distance matrix. Any help will be highly appreciated. Best, -Noushin -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruchika.work at gmail.com Fri Jun 30 09:06:09 2017 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Fri, 30 Jun 2017 09:06:09 -0400 Subject: [scikit-learn] Machine learning for PU data Message-ID: Hi All, I am a scikit-learn user and have a question for the community, if anyone has applied any available machine learning algorithms in the scikit-learn package for data with positive and unlabeled class only? If so would you share some insight with me. I understand this could be a broader topic but I am new to analyzing PU data and hence can use some help. Thanks, Ruchika -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Fri Jun 30 09:39:47 2017 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Fri, 30 Jun 2017 16:39:47 +0300 Subject: [scikit-learn] Machine learning for PU data In-Reply-To: References: Message-ID: Hello Ruchika, I don't think that scikit-learn currently has algorithms that can train with positive and unlabeled class labels only. However, you could try one of the following compatible wrappers, - http://nktmemo.github.io/jekyll/update/2015/11/07/pu_classification.html - https://github.com/scikit-learn/scikit-learn/pull/371 (haven't tried them myself). Also, you could try one class SVM as suggested here https://stackoverflow.com/questions/25700724/binary-semi-supervised-classification-with-positive-only-and-unlabeled-data-set -- Roman On 30/06/17 16:06, Ruchika Nayyar wrote: > Hi All, > > I am a scikit-learn user and have a question for the community, if > anyone has applied any available machine learning algorithms in the > scikit-learn package for data with positive and unlabeled class only? If > so would you share some insight with me. I understand this could be a > broader topic but I am new to analyzing PU data and hence can use some > help. > > Thanks, > Ruchika > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From s.atasever at gmail.com Fri Jun 30 10:14:36 2017 From: s.atasever at gmail.com (Sema Atasever) Date: Fri, 30 Jun 2017 17:14:36 +0300 Subject: [scikit-learn] Construct the microclusters using a CF-Tree Message-ID: Hi all, I want to ask you about clustering usign Birch clustering algorithm. I have a *distance matrix* n*n M where M_ij is the distance between object_i and object_j.(You can see file format in the attachment). I want to cluster them using Birch clustering algorithm. Does this method have 'precomputed' option. I needed train an SVM on the centroids of the microclusters so *How can i get the centroids of the microclusters?* Any help would be highly appreciated. *Birch code:* from sklearn.cluster import Birch from io import StringIO import numpy as np X=np.loadtxt(open("C:\dm.txt", "rb"), delimiter="\t") brc = Birch(branching_factor=50, n_clusters=3, threshold=0.5,compute_labels=True,copy=True) brc.fit(X) brc.predict(X) print(brc.predict(X)) Any help would be highly appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- 0.000 1.000 1.000 1.000 0.444 0.667 0.500 0.563 0.500 0.375 1.000 0.633 0.375 0.625 1.000 0.684 1.000 1.000 0.583 1.000 1.000 1.000 1.000 0.654 0.615 1.000 1.000 1.000 0.725 1.000 0.500 0.333 0.400 1.000 0.765 0.692 1.000 0.583 0.714 0.500 1.000 0.600 0.632 1.000 0.375 0.556 1.000 0.654 0.462 0.720 1.000 0.000 0.800 1.000 1.000 1.000 0.607 0.583 1.000 0.500 1.000 0.333 0.571 0.717 1.000 0.667 1.000 1.000 0.745 0.692 0.652 0.333 0.692 0.643 0.632 0.727 0.655 1.000 1.000 1.000 0.333 0.542 0.826 0.500 1.000 0.640 1.000 0.429 0.333 0.704 0.643 0.417 0.719 0.500 1.000 0.765 0.400 0.650 1.000 1.000 1.000 0.444 0.000 0.727 1.000 1.000 1.000 1.000 1.000 0.766 0.444 1.000 1.000 0.611 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.739 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.867 1.000 1.000 1.000 0.556 1.000 1.000 1.000 1.000 0.200 1.000 1.000 0.333 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.727 0.000 0.588 1.000 1.000 1.000 1.000 0.652 1.000 1.000 0.636 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.667 0.707 1.000 1.000 1.000 0.429 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.444 1.000 1.000 0.400 1.000 0.742 1.000 1.000 1.000 1.000 1.000 0.783 1.000 0.611 1.000 0.364 1.000 1.000 0.588 0.000 1.000 1.000 0.143 1.000 1.000 1.000 1.000 0.792 0.611 1.000 0.333 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.571 1.000 0.546 1.000 1.000 1.000 1.000 1.000 1.000 0.667 0.556 1.000 1.000 0.733 0.333 1.000 1.000 0.640 0.692 1.000 1.000 1.000 1.000 0.000 1.000 0.167 1.000 0.500 1.000 1.000 0.632 1.000 1.000 0.720 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.809 1.000 0.375 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.636 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.500 0.333 1.000 0.760 0.286 0.625 1.000 1.000 0.700 0.500 0.615 1.000 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.667 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.375 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.667 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.563 0.583 1.000 1.000 0.167 0.167 1.000 0.000 0.710 0.500 0.667 0.429 0.583 0.750 1.000 1.000 0.200 1.000 0.455 0.500 1.000 0.500 0.429 0.773 0.429 0.455 0.767 1.000 1.000 0.500 0.600 0.615 0.746 1.000 0.533 0.769 0.200 0.500 0.563 0.286 0.643 0.500 0.667 0.375 0.722 0.500 0.654 1.000 0.647 0.714 0.500 1.000 1.000 1.000 1.000 1.000 1.000 0.710 0.000 1.000 1.000 0.783 0.539 1.000 0.000 1.000 0.667 1.000 1.000 1.000 1.000 1.000 0.429 0.714 0.522 0.500 0.636 1.000 0.600 0.704 0.636 0.680 0.588 1.000 1.000 1.000 0.333 1.000 0.250 1.000 0.400 1.000 0.800 1.000 1.000 1.000 1.000 1.000 0.571 0.615 0.375 0.500 0.766 0.652 1.000 0.500 1.000 0.500 1.000 0.000 0.556 0.636 0.809 1.000 1.000 0.167 0.771 0.636 1.000 1.000 1.000 0.722 0.708 1.000 1.000 1.000 1.000 1.000 0.556 1.000 0.679 0.667 0.867 1.000 0.700 1.000 0.690 0.778 0.588 0.767 0.526 0.600 1.000 0.625 0.500 0.333 0.750 1.000 1.000 0.500 1.000 1.000 0.444 1.000 1.000 1.000 1.000 0.667 1.000 0.556 0.000 0.750 1.000 0.783 1.000 1.000 1.000 0.563 1.000 0.333 1.000 1.000 1.000 1.000 0.588 1.000 0.556 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.333 1.000 0.500 0.444 1.000 1.000 1.000 0.600 1.000 1.000 1.000 0.633 0.333 1.000 1.000 1.000 1.000 1.000 0.429 0.783 0.636 0.750 0.000 0.625 1.000 1.000 1.000 1.000 0.600 0.600 0.429 1.000 1.000 0.714 1.000 0.600 1.000 1.000 1.000 0.680 1.000 1.000 1.000 1.000 0.750 1.000 1.000 0.286 0.546 1.000 0.500 1.000 0.556 0.429 0.500 1.000 0.739 0.546 1.000 1.000 0.703 0.375 0.571 1.000 0.636 0.783 0.429 1.000 0.583 0.539 0.809 1.000 0.625 0.000 1.000 1.000 0.500 0.650 0.722 0.654 0.619 0.677 0.333 0.700 1.000 1.000 0.333 1.000 0.455 0.667 1.000 0.643 0.333 0.750 0.758 1.000 0.600 0.546 0.375 1.000 0.778 0.500 0.375 0.600 0.571 0.556 0.683 0.750 1.000 1.000 0.737 0.625 0.717 0.611 1.000 0.333 1.000 1.000 0.750 1.000 1.000 0.783 1.000 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.500 1.000 0.444 1.000 1.000 1.000 0.724 1.000 1.000 1.000 0.500 0.640 1.000 0.650 0.842 1.000 0.375 1.000 0.444 1.000 1.000 0.588 1.000 0.704 0.600 0.667 1.000 1.000 0.723 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.667 1.000 1.000 0.500 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.684 0.667 1.000 1.000 0.333 0.720 1.000 1.000 1.000 0.167 1.000 1.000 0.500 1.000 1.000 0.000 1.000 0.667 1.000 1.000 0.667 1.000 0.667 1.000 1.000 0.625 0.667 1.000 0.250 1.000 0.667 0.250 0.714 1.000 1.000 1.000 0.735 1.000 0.583 1.000 0.636 1.000 1.000 0.429 1.000 0.727 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.200 0.667 0.771 1.000 1.000 0.650 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000 0.571 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.667 0.455 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.167 0.692 1.000 0.500 1.000 0.625 1.000 0.756 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.667 1.000 1.000 0.636 0.563 0.600 0.722 1.000 1.000 0.667 1.000 0.000 1.000 0.643 1.000 1.000 0.455 1.000 0.333 0.700 0.500 0.667 1.000 1.000 1.000 1.000 1.000 0.333 0.700 1.000 1.000 1.000 1.000 0.429 1.000 1.000 0.444 0.500 1.000 1.000 1.000 1.000 1.000 1.000 0.583 0.745 1.000 1.000 1.000 1.000 1.000 0.455 1.000 1.000 1.000 0.600 0.654 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 0.741 0.706 1.000 1.000 0.429 0.500 1.000 1.000 1.000 1.000 1.000 0.677 1.000 0.744 1.000 0.680 1.000 1.000 1.000 1.000 0.286 0.444 0.710 0.650 0.704 0.826 0.630 1.000 1.000 1.000 0.692 1.000 1.000 1.000 1.000 1.000 0.500 1.000 1.000 0.333 0.429 0.619 1.000 1.000 1.000 1.000 0.643 1.000 0.000 1.000 1.000 0.636 1.000 0.692 1.000 0.333 0.667 1.000 1.000 0.800 0.571 0.688 1.000 0.686 0.652 0.682 0.250 1.000 0.444 0.286 1.000 1.000 0.649 1.000 0.286 1.000 1.000 1.000 1.000 1.000 0.652 1.000 0.667 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.667 1.000 1.000 0.667 1.000 1.000 1.000 1.000 0.000 1.000 0.579 1.000 0.539 1.000 0.643 1.000 1.000 0.571 1.000 1.000 1.000 1.000 1.000 1.000 0.805 1.000 1.000 1.000 1.000 1.000 1.000 0.739 0.600 1.000 1.000 1.000 1.000 0.522 1.000 0.333 1.000 0.707 1.000 1.000 1.000 0.500 1.000 0.722 1.000 1.000 0.333 0.500 1.000 1.000 0.571 1.000 0.741 1.000 1.000 0.000 0.723 0.167 1.000 1.000 1.000 1.000 0.778 1.000 1.000 0.657 1.000 1.000 0.643 1.000 0.667 0.719 0.556 0.795 0.706 0.429 1.000 0.333 0.333 0.500 0.455 0.759 1.000 0.722 1.000 0.692 1.000 1.000 1.000 1.000 1.000 0.429 0.429 0.708 1.000 0.714 0.700 1.000 1.000 0.667 1.000 0.455 0.706 0.636 0.579 0.723 0.000 1.000 1.000 1.000 0.643 1.000 0.706 1.000 0.714 1.000 0.588 1.000 0.556 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.500 1.000 0.667 1.000 0.556 1.000 1.000 0.429 0.654 0.643 0.739 1.000 1.000 0.769 1.000 0.773 0.714 1.000 1.000 1.000 1.000 0.444 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.167 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.333 0.760 1.000 1.000 1.000 1.000 1.000 0.375 1.000 1.000 1.000 1.000 1.000 0.724 1.000 1.000 1.000 1.000 1.000 1.000 0.615 0.632 1.000 1.000 1.000 1.000 1.000 0.429 0.522 1.000 0.588 0.600 1.000 1.000 1.000 1.000 1.000 0.333 1.000 0.692 0.539 1.000 1.000 1.000 0.000 0.455 0.600 1.000 1.000 1.000 1.000 0.647 0.722 1.000 1.000 1.000 1.000 0.583 1.000 1.000 0.500 0.500 0.500 0.429 0.556 0.667 0.769 1.000 0.333 0.550 1.000 0.727 1.000 0.429 1.000 1.000 0.375 0.455 0.500 1.000 1.000 1.000 0.333 1.000 1.000 0.625 1.000 0.700 0.429 1.000 1.000 1.000 1.000 1.000 0.455 0.000 0.730 1.000 1.000 1.000 0.583 1.000 0.625 1.000 0.625 0.583 0.762 1.000 1.000 0.500 1.000 1.000 0.719 1.000 0.760 0.667 0.364 0.167 1.000 1.000 1.000 0.364 1.000 1.000 1.000 1.000 1.000 0.750 0.636 1.000 0.556 1.000 1.000 1.000 1.000 0.667 1.000 0.500 0.500 0.429 0.643 1.000 0.643 1.000 0.600 0.700 0.000 1.000 1.000 1.000 1.000 0.556 1.000 0.625 0.611 1.000 1.000 1.000 1.000 1.000 0.500 0.286 0.462 1.000 0.583 0.600 0.583 0.824 1.000 0.684 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.455 0.724 1.000 1.000 1.000 0.667 1.000 0.667 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 1.000 0.778 1.000 1.000 1.000 1.000 0.250 0.500 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.583 1.000 1.000 1.000 1.000 0.725 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.600 0.556 1.000 0.680 0.667 1.000 1.000 0.250 1.000 1.000 1.000 1.000 1.000 0.778 0.706 1.000 1.000 1.000 1.000 1.000 0.000 1.000 0.793 0.731 1.000 1.000 1.000 1.000 0.333 1.000 1.000 0.722 1.000 1.000 0.696 1.000 0.286 0.400 1.000 1.000 1.000 0.429 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.500 0.704 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.571 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.333 1.000 1.000 0.790 1.000 1.000 1.000 1.000 1.000 0.500 0.333 1.000 1.000 1.000 1.000 1.000 0.600 0.636 0.679 1.000 1.000 0.643 1.000 1.000 0.667 0.667 1.000 1.000 0.800 1.000 1.000 0.714 0.333 1.000 0.583 1.000 1.000 0.793 1.000 0.000 0.727 0.583 1.000 0.625 1.000 0.556 0.556 1.000 0.571 0.625 0.769 1.000 1.000 1.000 0.583 1.000 0.647 1.000 0.692 0.333 0.542 1.000 1.000 1.000 1.000 1.000 0.615 0.680 0.667 1.000 1.000 0.333 0.500 1.000 0.250 0.455 1.000 1.000 0.571 1.000 0.657 1.000 0.760 0.647 1.000 0.556 0.778 0.731 1.000 0.727 0.000 1.000 1.000 0.333 1.000 0.400 0.571 1.000 0.167 0.750 0.775 0.686 1.000 0.773 0.500 0.652 1.000 0.692 0.688 0.400 0.826 0.867 1.000 0.571 1.000 1.000 0.746 0.588 0.867 1.000 1.000 0.750 0.640 1.000 0.714 1.000 1.000 0.677 0.688 1.000 1.000 0.588 1.000 0.722 0.625 1.000 1.000 1.000 1.000 0.583 1.000 0.000 1.000 1.000 1.000 1.000 1.000 0.714 0.400 0.333 1.000 0.700 0.727 0.667 0.600 1.000 1.000 1.000 1.000 1.000 0.500 1.000 1.000 1.000 0.636 1.000 1.000 1.000 1.000 1.000 0.750 0.758 1.000 1.000 1.000 1.000 0.333 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.625 1.000 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000 1.000 0.563 1.000 0.629 1.000 1.000 1.000 1.000 0.563 0.615 0.625 1.000 0.765 1.000 1.000 1.000 0.546 1.000 0.667 0.533 1.000 0.700 1.000 1.000 1.000 0.650 1.000 1.000 1.000 0.700 0.744 0.686 1.000 0.643 0.556 1.000 1.000 0.625 0.611 1.000 1.000 1.000 0.625 0.333 1.000 1.000 0.000 1.000 1.000 1.000 1.000 0.333 1.000 0.556 1.000 0.688 0.600 0.667 1.000 0.000 1.000 1.000 0.692 0.640 1.000 0.444 1.000 1.000 1.000 0.769 1.000 1.000 1.000 1.000 0.600 0.842 1.000 1.000 1.000 1.000 1.000 0.652 1.000 1.000 1.000 1.000 1.000 0.583 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 0.333 1.000 1.000 1.000 1.000 1.000 0.680 0.400 0.632 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.556 1.000 1.000 1.000 1.000 0.200 0.333 0.690 1.000 0.286 0.546 1.000 1.000 0.735 1.000 1.000 0.680 0.682 0.805 0.667 1.000 1.000 1.000 0.762 1.000 0.250 0.333 1.000 0.556 0.400 1.000 1.000 1.000 0.333 0.000 0.727 0.692 0.714 0.615 1.000 0.500 0.429 1.000 0.546 1.000 1.000 1.000 0.711 0.583 0.429 1.000 1.000 1.000 1.000 1.000 0.500 1.000 0.778 1.000 0.546 0.375 0.375 0.667 1.000 1.000 1.000 1.000 0.250 1.000 0.719 1.000 0.375 0.583 1.000 1.000 0.500 1.000 1.000 0.556 0.571 1.000 1.000 1.000 1.000 0.727 0.000 0.600 0.667 1.000 0.769 0.500 1.000 1.000 1.000 1.000 0.667 0.600 1.000 0.714 0.333 1.000 0.400 1.000 1.000 1.000 0.563 0.250 0.588 1.000 1.000 1.000 1.000 1.000 0.583 1.000 1.000 1.000 1.000 1.000 0.556 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.714 1.000 1.000 1.000 0.692 0.600 0.000 0.615 1.000 0.400 0.696 0.625 0.625 0.640 0.600 0.429 0.750 1.000 0.500 0.704 1.000 1.000 1.000 1.000 1.000 0.286 1.000 0.767 0.333 0.500 0.778 0.444 1.000 1.000 0.167 0.429 1.000 0.444 1.000 0.795 1.000 1.000 1.000 0.500 1.000 1.000 0.722 1.000 0.571 0.167 0.400 0.563 0.333 1.000 0.714 0.667 0.615 0.000 0.333 0.556 1.000 0.636 1.000 0.571 0.667 0.500 1.000 1.000 1.000 0.583 1.000 0.742 1.000 1.000 1.000 0.643 0.400 0.526 1.000 1.000 0.500 1.000 0.500 0.636 0.692 1.000 1.000 0.286 1.000 0.706 1.000 1.000 0.500 1.000 0.500 1.000 1.000 1.000 0.625 0.750 0.333 1.000 1.000 1.000 0.615 1.000 1.000 0.333 0.000 0.533 0.696 0.737 1.000 0.716 0.667 1.000 0.546 1.000 0.600 0.417 0.200 1.000 0.667 0.500 1.000 0.500 1.000 0.600 0.500 0.556 0.375 1.000 1.000 1.000 1.000 1.000 0.286 1.000 1.000 0.429 1.000 1.000 0.500 1.000 1.000 1.000 1.000 0.333 0.769 0.775 1.000 0.629 0.556 1.000 1.000 0.769 0.400 0.556 0.533 0.000 1.000 0.571 0.650 0.571 1.000 0.500 0.533 1.000 0.632 0.719 1.000 1.000 0.556 0.333 1.000 0.667 0.800 1.000 0.444 0.429 0.600 0.588 1.000 1.000 0.500 0.444 0.444 1.000 1.000 1.000 0.500 1.000 0.500 0.719 0.462 1.000 0.696 1.000 1.000 0.686 0.700 1.000 1.000 0.680 0.500 0.500 0.696 1.000 0.696 1.000 0.000 0.167 1.000 1.000 0.529 0.704 0.714 0.500 1.000 0.500 1.000 1.000 1.000 1.000 1.000 0.375 1.000 0.625 1.000 0.500 0.571 1.000 1.000 0.429 1.000 0.500 0.710 0.649 0.667 0.333 1.000 0.724 0.429 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.727 1.000 0.688 0.400 0.429 1.000 0.625 0.636 0.737 0.571 0.167 0.000 1.000 0.680 0.500 0.333 0.571 0.700 0.375 1.000 0.333 1.000 1.000 0.760 1.000 0.722 1.000 0.500 1.000 1.000 0.556 0.704 1.000 1.000 0.625 1.000 0.650 1.000 0.600 0.333 0.667 1.000 0.556 0.760 0.583 1.000 0.286 0.790 1.000 0.773 0.667 1.000 0.600 0.632 1.000 1.000 0.625 1.000 1.000 0.650 1.000 1.000 0.000 0.607 1.000 0.737 1.000 0.417 0.556 0.706 1.000 1.000 0.667 0.167 1.000 0.500 1.000 0.588 1.000 0.739 0.683 0.600 1.000 0.727 1.000 1.000 0.704 0.333 1.000 0.500 1.000 1.000 0.619 0.667 0.600 0.739 0.500 1.000 0.583 0.500 0.600 1.000 0.667 1.000 0.546 1.000 0.640 0.571 0.684 0.571 1.000 0.680 0.607 0.000 0.743 0.730 0.556 0.563 1.000 0.400 1.000 0.783 0.333 0.625 1.000 0.654 1.000 0.750 0.667 0.546 0.750 0.667 1.000 1.000 0.756 1.000 0.800 1.000 1.000 0.455 0.556 1.000 0.769 0.364 0.583 1.000 1.000 1.000 1.000 0.714 1.000 0.563 1.000 1.000 1.000 1.000 0.600 1.000 0.667 1.000 0.529 0.500 1.000 1.000 0.000 0.462 0.556 0.615 0.654 0.650 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.630 1.000 1.000 0.759 1.000 1.000 1.000 0.167 1.000 1.000 1.000 1.000 0.647 1.000 1.000 0.615 0.000 1.000 1.000 0.667 0.429 0.500 1.000 0.500 0.704 0.333 0.737 0.730 0.462 0.000 0.722 0.633 0.462 1.000 1.000 0.611 1.000 1.000 1.000 0.647 0.571 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.333 1.000 1.000 1.000 1.000 1.000 1.000 0.692 1.000 0.625 1.000 1.000 1.000 0.600 0.750 1.000 0.546 0.533 0.714 0.571 1.000 0.500 0.556 0.722 0.000 0.546 0.720 1.000 1.000 1.000 0.640 0.700 1.000 0.714 0.615 0.500 1.000 0.703 0.737 0.723 1.000 1.000 1.000 1.000 1.000 1.000 0.522 0.722 0.429 1.000 0.550 1.000 0.684 1.000 0.429 1.000 0.692 0.688 1.000 1.000 1.000 1.000 0.711 1.000 1.000 1.000 1.000 1.000 0.500 0.700 0.417 0.563 0.615 0.633 0.546 0.000 From rth.yurchak at gmail.com Fri Jun 30 10:42:11 2017 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Fri, 30 Jun 2017 17:42:11 +0300 Subject: [scikit-learn] Construct the microclusters using a CF-Tree In-Reply-To: References: Message-ID: <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com> Hello Sema, On 30/06/17 17:14, Sema Atasever wrote: > I want to cluster them using Birch clustering algorithm. > Does this method have 'precomputed' option. No it doesn't, see http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html so you would need to provide it with the original features matrix (not the precomputed distance matrix). Since your dataset is fairly small, there is no reason in precomputing it anyway. > I needed train an SVM on the centroids of the microclusters so > *How can i get the centroids of the microclusters?* By "microclusters" do you mean sub-clusters? If you are interested in the leaves subclusters see the Birch.subcluster_centers_ parameter. Otherwise if you want all the centroids in the hierarchy of subclusters, you can browse the hierarchical tree via the Birch.root_ attribute then look at _CFSubcluster.centroid_ for each subcluster. Hope this helps, -- Roman From olivier.grisel at ensta.org Fri Jun 30 11:54:53 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 30 Jun 2017 17:54:53 +0200 Subject: [scikit-learn] Agglomerative clustering In-Reply-To: References: Message-ID: You can have a look at the test named "test_agglomerative_clustering" in: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/tests/test_hierarchical.py -- Olivier From jbbrown at kuhp.kyoto-u.ac.jp Fri Jun 30 12:07:27 2017 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Sat, 1 Jul 2017 01:07:27 +0900 Subject: [scikit-learn] Fwd: [SciPy-User] EuroSciPy 2017 call for contributions - extension of deadline In-Reply-To: References: <20170626104953.GA16487@pi-x230> Message-ID: Dear Communities, Would it be of interest to the audience to hear a discussion on the state of the art in computational drug discovery model development, which my team and I have done by building on top of Scikit-learn and Matplotlib? Everyday language description of the work and concept: https://www.eurekalert.org/pub_releases/2017-03/ku-cas032917.php While I am based in Kyoto, I will be in Bonn for all of August and September, so I would be thrilled to meet the communities and exchange ideas. With kindest regards, J.B. Brown Leader, Life Science Informatics Research Unit Kyoto University Graduate School of Medicine 2017-06-27 1:28 GMT+09:00 Nelle Varoquaux : > Hi everyone, > > I thought some of you might be interested in this dead line extension. > > Cheers, > N > > > ---------- Forwarded message ---------- > From: Pierre de Buyl > Date: 26 June 2017 at 03:49 > Subject: [SciPy-User] EuroSciPy 2017 call for contributions - > extension of deadline > To: scipy-user at python.org, numpy-discussion at python.org > > > (Apologies if you receive multiple copies of this message) > > 10th European Conference on Python in Science > > August 28 - September 1, 2017 in Erlangen, Germany > > The Call for Papers is extended to July 02, 2017 23:00 CEST > > Description: > > The EuroSciPy meeting is a cross-disciplinary gathering focused on the use > and > development of the Python language in scientific research. This event > strives > to bring together both users and developers of scientific tools, as well as > academic research and state of the art industry. > > Erlangen is one of Germany's major science hubs and located north of > Munich (90 > minutes by train). > > The Call for Papers is extended to July 02, 2017 23:00 CEST > > Regards, > > The EuroSciPy team > https://www.euroscipy.org/2017/ > > > _______________________________________________ > SciPy-User mailing list > SciPy-User at python.org > https://mail.python.org/mailman/listinfo/scipy-user > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From francois.dion at gmail.com Fri Jun 30 12:47:35 2017 From: francois.dion at gmail.com (Francois Dion) Date: Fri, 30 Jun 2017 12:47:35 -0400 Subject: [scikit-learn] Scikit-learn at Data Intelligence this past weekend Message-ID: This past weekend was the Numfocus sponsored Data Intelligence conference at Capital One, in Mclean, Virginia (close to Washington DC for those not familiar with the US geography). A few presentations mentioned/used scikit-learn, including Ben Bengfort's Visual Pipelines ( http://data-intelligence.ai/presentations/13 ), Zachary Beaver's Airflow + Scikit-Learn ( http://data-intelligence.ai/presentations/19 ) and Pramit Choudary's Learning to Learn Model Behavior ( http://data-intelligence.ai/presentations/22 ), to name a few. I presented "Seeking Exotics" on Sunday ( http://data-intelligence.ai/presentations/21), on anomalous and erroneous data, and how statistics, visualizations and scikit-learn can help (covered PCA, truncatedSVD, t-sne, ellipticenvelope, one class classifiers and scikit-learn related imbalanced-learn and sk-sos). One of the slide I had up resonated quite a bit with the audience, both in person and on social media: https://twitter.com/tnfilipiak/status/878999245076008960 The notebooks are on github: https://github.com/fdion/seeking_exotics Francois -- @f_dion - https://about.me/francois.dion - https://www.linkedin.com/in/francois-dion-8b639b79/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From b.noushin7 at gmail.com Fri Jun 30 12:53:02 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Fri, 30 Jun 2017 12:53:02 -0400 Subject: [scikit-learn] Agglomerative Clustering without knowing number of clusters Message-ID: I want to perform agglomerative clustering, but I have no idea of number of clusters before hand. But I want that every cluster has at least 40 data points in it. How can I apply this to sklearn.agglomerative clustering? Should I use dendrogram and cut it somehow? I have no idea how to relate dendrogram to this and cutting it out. Any help will be appreciated! -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Fri Jun 30 16:37:01 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 30 Jun 2017 22:37:01 +0200 Subject: [scikit-learn] Scikit-learn at Data Intelligence this past weekend In-Reply-To: References: Message-ID: Thanks for this report! -- Olivier From olivier.grisel at ensta.org Fri Jun 30 16:39:27 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 30 Jun 2017 22:39:27 +0200 Subject: [scikit-learn] Fwd: [SciPy-User] EuroSciPy 2017 call for contributions - extension of deadline In-Reply-To: References: <20170626104953.GA16487@pi-x230> Message-ID: I am pretty sure this is exactly the kind of presentation that the EuroScipy audience would enjoy. Please submit! -- Olivier From jmschreiber91 at gmail.com Fri Jun 30 18:31:37 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Fri, 30 Jun 2017 15:31:37 -0700 Subject: [scikit-learn] Scikit-learn at Data Intelligence this past weekend In-Reply-To: References: Message-ID: Thanks for the summary. I was there as well, and it seemed like scikit-learn had a strong showing. It seemed as though many talks that weren't directly on scikit-learn still mentioned it or used the models during the presentation. On Fri, Jun 30, 2017 at 9:47 AM, Francois Dion wrote: > This past weekend was the Numfocus sponsored Data Intelligence conference > at Capital One, in Mclean, Virginia (close to Washington DC for those not > familiar with the US geography). > > A few presentations mentioned/used scikit-learn, including Ben Bengfort's > Visual Pipelines ( > http://data-intelligence.ai/presentations/13 ), Zachary Beaver's Airflow > + Scikit-Learn ( http://data-intelligence.ai/presentations/19 ) and > Pramit Choudary's Learning to Learn Model Behavior ( > http://data-intelligence.ai/presentations/22 ), to name a few. > > I presented "Seeking Exotics" on Sunday (http://data-intelligence.ai/ > presentations/21), on anomalous and erroneous data, and how statistics, > visualizations and scikit-learn can help (covered PCA, truncatedSVD, t-sne, > ellipticenvelope, one class classifiers and scikit-learn related > imbalanced-learn and sk-sos). > > One of the slide I had up resonated quite a bit with the audience, both in > person and on social media: > > https://twitter.com/tnfilipiak/status/878999245076008960 > > The notebooks are on github: https://github.com/fdion/seeking_exotics > > > Francois > -- > @f_dion - https://about.me/francois.dion - https://www.linkedin.com/in/ > francois-dion-8b639b79/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Fri Jun 30 19:54:10 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 1 Jul 2017 01:54:10 +0200 Subject: [scikit-learn] Scikit-learn at Data Intelligence this past weekend In-Reply-To: References: Message-ID: <20170630235410.GL4115401@phare.normalesup.org> Fantastic! Thanks a lot for the summary. Ga?l On Fri, Jun 30, 2017 at 12:47:35PM -0400, Francois Dion wrote: > This past weekend was the Numfocus sponsored Data Intelligence conference at > Capital One, in Mclean, Virginia (close to Washington DC for those not familiar > with the US geography). > A few presentations mentioned/used scikit-learn, including Ben Bengfort's > Visual Pipelines ( > http://data-intelligence.ai/presentations/13 ), Zachary Beaver's Airflow + > Scikit-Learn (?http://data-intelligence.ai/presentations/19 ) and Pramit > Choudary's Learning to Learn Model Behavior (?http://data-intelligence.ai/ > presentations/22 ), to name a few. > I presented "Seeking Exotics" on Sunday (http://data-intelligence.ai/ > presentations/21), on anomalous and erroneous data, and how statistics, > visualizations and scikit-learn can help (covered PCA, truncatedSVD, t-sne, > ellipticenvelope, one class classifiers and scikit-learn related > imbalanced-learn and sk-sos). > One of the slide I had up resonated quite a bit with the audience, both in > person and on social media: > https://twitter.com/tnfilipiak/status/878999245076008960 > The notebooks are on github:?https://github.com/fdion/seeking_exotics > Francois -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux