From jmschreiber91 at gmail.com Mon Apr 3 01:47:36 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sun, 2 Apr 2017 22:47:36 -0700 Subject: [scikit-learn] GSoC 2017 In-Reply-To: References: Message-ID: Less than 11 hours left in the application period! If you've asked for feedback and we haven't gotten back to you, make sure you submit anyway. If you don't get your submission in before the deadline (April 3rd, 9:00am PST) we won't be able to consider you. On Tue, Mar 21, 2017 at 3:27 PM, Jacob Schreiber wrote: > Starting yesterday, students were able to submit their proposals on the > GSoC website. Please review this site > > thoroughly before making a submission. We're eager to hear what prospective > students have in mind for a contribution to sklearn. > > As we've said before, mentor time is at a premium this year. If you've > posted a proposal and we haven't responded, please keep poking us. I know > that personally I tend to wake up to between 30-70 emails and have to > triage based on my availability, and that Gael likely scoffs at this small > number. Things fall through the cracks. If you haven't heard back that > doesn't mean we don't want your submission, please submit or ask for > feedback! > > A strong factor in determining if you're going to be chosen will be your > availability with the code and methods you'd like to work on. It is less > likely that we will take someone unfamiliar with the code base this year, > as there is a large starting cost to getting familiar with an intricate > code-base. In your application please emphasize your prior experience with > either sklearn code, cython code (if applicable for your project) or > machine learning code in general. > > Let us know if you have any other questions. > > Jacob > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Mon Apr 3 01:50:42 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sun, 2 Apr 2017 22:50:42 -0700 Subject: [scikit-learn] GSoC 2017 In-Reply-To: References: Message-ID: Make sure that you tag your proposal with 'scikit-learn' when you submit it so that we can identify them easily. On Sun, Apr 2, 2017 at 10:47 PM, Jacob Schreiber wrote: > Less than 11 hours left in the application period! If you've asked for > feedback and we haven't gotten back to you, make sure you submit anyway. If > you don't get your submission in before the deadline (April 3rd, 9:00am > PST) we won't be able to consider you. > > On Tue, Mar 21, 2017 at 3:27 PM, Jacob Schreiber > wrote: > >> Starting yesterday, students were able to submit their proposals on the >> GSoC website. Please review this site >> >> thoroughly before making a submission. We're eager to hear what prospective >> students have in mind for a contribution to sklearn. >> >> As we've said before, mentor time is at a premium this year. If you've >> posted a proposal and we haven't responded, please keep poking us. I know >> that personally I tend to wake up to between 30-70 emails and have to >> triage based on my availability, and that Gael likely scoffs at this small >> number. Things fall through the cracks. If you haven't heard back that >> doesn't mean we don't want your submission, please submit or ask for >> feedback! >> >> A strong factor in determining if you're going to be chosen will be your >> availability with the code and methods you'd like to work on. It is less >> likely that we will take someone unfamiliar with the code base this year, >> as there is a large starting cost to getting familiar with an intricate >> code-base. In your application please emphasize your prior experience with >> either sklearn code, cython code (if applicable for your project) or >> machine learning code in general. >> >> Let us know if you have any other questions. >> >> Jacob >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yizhengz at andrew.cmu.edu Mon Apr 3 03:13:34 2017 From: yizhengz at andrew.cmu.edu (Yizheng Zhao) Date: Mon, 3 Apr 2017 00:13:34 -0700 Subject: [scikit-learn] GSoC 2017 Proposal: Improve online learning for linear models Message-ID: Hi developers, It is excited that I have opportunity work with you! I am Yizheng Zhao, a graduate student at Carnegie Mellon University majoring in Software Engineering and I?ve got my Bachelor?s degree in Math in 2016 at Jilin University. I love python and machine learning and that why I wanna make my own contribution to community. I have 2 years experience developing with python and I am quite familiar with scikit-learn as a user. In college, I learned several machine learning algorithms and their mathematical derivation. I believe I can do better than others with my strong Math background and coding skills. Here is my proposal. https://github.com/YizhengZHAO/scikit-learn/wiki/GSoC-2017-:-Improve-online-learning-for-linear-models BTW, could you please give me more explanation about ?A tool to set the learning rate on a few epochs?? I am happy to get suggestions from the community. Sincerely, Yizheng -------------- next part -------------- An HTML attachment was scrubbed... URL: From shuchi.23 at gmail.com Mon Apr 3 07:38:33 2017 From: shuchi.23 at gmail.com (Shuchi Mala) Date: Mon, 3 Apr 2017 17:08:33 +0530 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: <293EEA4E-2D51-4151-9A1F-D57CF628A71C@gmail.com> References: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> <293EEA4E-2D51-4151-9A1F-D57CF628A71C@gmail.com> Message-ID: How can I get ground truth labels of the training examples in my dataset? With Best Regards, Shuchi Mala Research Scholar Department of Civil Engineering MNIT Jaipur On Fri, Mar 31, 2017 at 8:17 PM, Sebastian Raschka wrote: > Hi, Shuchi, > > regarding labels_true: you?d only be able to compute the rand index > adjusted for chance if you have the ground truth labels iof the training > examples in your dataset. > > The second parameter, labels_pred, takes in the predicted cluster labels > (indices) that you got from the clustering. E.g, > > dbscn = DBSCAN() > labels_pred = dbscn.fit(X).predict(X) > > Best, > Sebastian > > > > On Mar 31, 2017, at 12:02 AM, Shuchi Mala wrote: > > > > Thank you so much for your quick reply. I have one more doubt. The below > statement is used to calculate rand score. > > > > metrics.adjusted_rand_score(labels_true, labels_pred) > > In my case what will be labels_true and labels_pred and how I will > calculate labels_pred? > > > > With Best Regards, > > Shuchi Mala > > Research Scholar > > Department of Civil Engineering > > MNIT Jaipur > > > > > > On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby < > shane.grigsby at colorado.edu> wrote: > > Since you're using lat / long coords, you'll also want to convert them > to radians and specify 'haversine' as your distance metric; i.e. : > > > > coords = np.vstack([lats.ravel(),longs.ravel()]).T > > coords *= np.pi / 180. # to radians > > > > ...and: > > > > db = DBSCAN(eps=0.3, min_samples=10, metric='haversine') > > # replace eps and min_samples as appropriate > > db.fit(coords) > > > > Cheers, > > Shane > > > > > > On 03/30, Sebastian Raschka wrote: > > Hi, Shuchi, > > > > 1. How can I add data to the data set of the package? > > > > You don?t need to add your dataset to the dataset module to run your > analysis. A convenient way to load it into a numpy array would be via > pandas. E.g., > > > > import pandas as pd > > df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) > > X = df.values > > > > 2. How I can calculate Rand index for my data? > > > > After you ran the clustering, you can use the ?adjusted_rand_score? > function, e.g., see > > http://scikit-learn.org/stable/modules/clustering. > html#adjusted-rand-score > > > > 3. How to use make_blobs command for my data? > > > > The make_blobs command is just a utility function to create toydatasets, > you wouldn?t need it in your case since you already have ?real? data. > > > > Best, > > Sebastian > > > > > > On Mar 30, 2017, at 4:51 AM, Shuchi Mala wrote: > > > > Hi everyone, > > > > I have the data with following attributes: (Latitude, Longitude). Now I > am performing clustering using DBSCAN for my data. I have following doubts: > > > > 1. How can I add data to the data set of the package? > > 2. How I can calculate Rand index for my data? > > 3. How to use make_blobs command for my data? > > > > Sample of my data is : > > Latitude Longitude > > 37.76901 -122.429299 > > 37.76904 -122.42913 > > 37.76878 -122.429092 > > 37.7763 -122.424249 > > 37.77627 -122.424657 > > > > > > With Best Regards, > > Shuchi Mala > > Research Scholar > > Department of Civil Engineering > > MNIT Jaipur > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > > *PhD candidate & Research Assistant* > > *Cooperative Institute for Research in Environmental Sciences (CIRES)* > > *University of Colorado at Boulder* > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Apr 3 10:35:08 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 3 Apr 2017 10:35:08 -0400 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: References: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> <293EEA4E-2D51-4151-9A1F-D57CF628A71C@gmail.com> Message-ID: Don?t get me wrong, but you?d have to either manually label them yourself, asking domain experts, or use platforms like Amazon Turk (or collect them in some other way). > On Apr 3, 2017, at 7:38 AM, Shuchi Mala wrote: > > How can I get ground truth labels of the training examples in my dataset? > > With Best Regards, > Shuchi Mala > Research Scholar > Department of Civil Engineering > MNIT Jaipur > > > On Fri, Mar 31, 2017 at 8:17 PM, Sebastian Raschka wrote: > Hi, Shuchi, > > regarding labels_true: you?d only be able to compute the rand index adjusted for chance if you have the ground truth labels iof the training examples in your dataset. > > The second parameter, labels_pred, takes in the predicted cluster labels (indices) that you got from the clustering. E.g, > > dbscn = DBSCAN() > labels_pred = dbscn.fit(X).predict(X) > > Best, > Sebastian > > > > On Mar 31, 2017, at 12:02 AM, Shuchi Mala wrote: > > > > Thank you so much for your quick reply. I have one more doubt. The below statement is used to calculate rand score. > > > > metrics.adjusted_rand_score(labels_true, labels_pred) > > In my case what will be labels_true and labels_pred and how I will calculate labels_pred? > > > > With Best Regards, > > Shuchi Mala > > Research Scholar > > Department of Civil Engineering > > MNIT Jaipur > > > > > > On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby wrote: > > Since you're using lat / long coords, you'll also want to convert them to radians and specify 'haversine' as your distance metric; i.e. : > > > > coords = np.vstack([lats.ravel(),longs.ravel()]).T > > coords *= np.pi / 180. # to radians > > > > ...and: > > > > db = DBSCAN(eps=0.3, min_samples=10, metric='haversine') > > # replace eps and min_samples as appropriate > > db.fit(coords) > > > > Cheers, > > Shane > > > > > > On 03/30, Sebastian Raschka wrote: > > Hi, Shuchi, > > > > 1. How can I add data to the data set of the package? > > > > You don?t need to add your dataset to the dataset module to run your analysis. A convenient way to load it into a numpy array would be via pandas. E.g., > > > > import pandas as pd > > df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) > > X = df.values > > > > 2. How I can calculate Rand index for my data? > > > > After you ran the clustering, you can use the ?adjusted_rand_score? function, e.g., see > > http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-score > > > > 3. How to use make_blobs command for my data? > > > > The make_blobs command is just a utility function to create toydatasets, you wouldn?t need it in your case since you already have ?real? data. > > > > Best, > > Sebastian > > > > > > On Mar 30, 2017, at 4:51 AM, Shuchi Mala wrote: > > > > Hi everyone, > > > > I have the data with following attributes: (Latitude, Longitude). Now I am performing clustering using DBSCAN for my data. I have following doubts: > > > > 1. How can I add data to the data set of the package? > > 2. How I can calculate Rand index for my data? > > 3. How to use make_blobs command for my data? > > > > Sample of my data is : > > Latitude Longitude > > 37.76901 -122.429299 > > 37.76904 -122.42913 > > 37.76878 -122.429092 > > 37.7763 -122.424249 > > 37.77627 -122.424657 > > > > > > With Best Regards, > > Shuchi Mala > > Research Scholar > > Department of Civil Engineering > > MNIT Jaipur > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > > *PhD candidate & Research Assistant* > > *Cooperative Institute for Research in Environmental Sciences (CIRES)* > > *University of Colorado at Boulder* > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From msaunders83 at gmail.com Mon Apr 3 14:04:51 2017 From: msaunders83 at gmail.com (mat saunders) Date: Mon, 3 Apr 2017 14:04:51 -0400 Subject: [scikit-learn] Fwd: SSIM with tolerances In-Reply-To: References: Message-ID: Hi, I am using SSIM to compare 2 video streams\sets of images and I find it to be almost too accurate. I would like some fudge factor like other image comparison tools have. I used to do it in an automated test suite but due to file sizes and amounts I turned to scikit. I do quality assurance on a render engine and we just want to make sure the images are meaningfully identical build to build. Currently with SSIM I am seeing things as small as 4 pixels across a 1920x1080 image different. I personally would like to ignore those 4 pixels but still catch meaningful items. Say if 8 pixels near each other were off keep those but if they are 8 pixels randomly through the image ignore them. Does this sound like something logical, say using an adjacency of pixels with a tolerance value for color and number of pixels as arguments? See attached image for example of how little is different in the entire image. It is a GIF zoomed in to the exact spot of 3 different pixels so hopefully it works. Regards, Mathew Saunders -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 3dots-again.gif Type: image/gif Size: 101041 bytes Desc: not available URL: From ross at cgl.ucsf.edu Mon Apr 3 14:39:05 2017 From: ross at cgl.ucsf.edu (Bill Ross) Date: Mon, 3 Apr 2017 11:39:05 -0700 Subject: [scikit-learn] Fwd: SSIM with tolerances In-Reply-To: References: Message-ID: <55e9238a-7369-d264-7f58-7d6209af5dcb@cgl.ucsf.edu> I wonder naively: if you can make rules, why train something to learn them vs. just implementing them directly? I'm really curious if there's an advantage in logistics or performance (can meaningful extrapolation somehow occur?). I think the answer for machine learning is not to make rules, but to gather examples based on perceptual experiments, assuming what you are after is noticeability. In that case, you will likely allow dropouts (I assume black pixels) more when they are in dark areas, I imagine, which may not be desireable, or might save the corporation a few pennies. :-) Those perceptual experiments might be costly, would save the angst of getting the rules right, and I wonder what sort of Quality Index you might derive beyond pass/fail. The data might be leveraged for other applications. Or maybe you have existing data that could be used to train? I've only made general comparisons of images (using color histograms at the moment for my interactive image associator), but have the QA background to appreciate the motivation. I'd love to stay on top of it if a fellow learner could be of use. Regards, Bill On 4/3/17 11:04 AM, mat saunders wrote: > Hi, > > I am using SSIM to compare 2 video streams\sets of images and I find > it to be almost too accurate. I would like some fudge factor like > other image comparison tools have. I used to do it in an automated > test suite but due to file sizes and amounts I turned to scikit. > > I do quality assurance on a render engine and we just want to make > sure the images are meaningfully identical build to build. Currently > with SSIM I am seeing things as small as 4 pixels across a 1920x1080 > image different. I personally would like to ignore those 4 pixels but > still catch meaningful items. Say if 8 pixels near each other were off > keep those but if they are 8 pixels randomly through the image ignore > them. > > Does this sound like something logical, say using an adjacency of > pixels with a tolerance value for color and number of pixels as arguments? > > See attached image for example of how little is different in the > entire image. It is a GIF zoomed in to the exact spot of 3 different > pixels so hopefully it works. > > Regards, > Mathew Saunders > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From shuchi.23 at gmail.com Mon Apr 3 23:45:59 2017 From: shuchi.23 at gmail.com (Shuchi Mala) Date: Tue, 4 Apr 2017 09:15:59 +0530 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: References: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> <293EEA4E-2D51-4151-9A1F-D57CF628A71C@gmail.com> Message-ID: Hi Raschka, I want to know how to use cross validation when other regression model such as poisson is used in place of linear? Kindly help. With Best Regards, Shuchi Mala Research Scholar Department of Civil Engineering MNIT Jaipur On Mon, Apr 3, 2017 at 8:05 PM, Sebastian Raschka wrote: > Don?t get me wrong, but you?d have to either manually label them yourself, > asking domain experts, or use platforms like Amazon Turk (or collect them > in some other way). > > > On Apr 3, 2017, at 7:38 AM, Shuchi Mala wrote: > > > > How can I get ground truth labels of the training examples in my > dataset? > > > > With Best Regards, > > Shuchi Mala > > Research Scholar > > Department of Civil Engineering > > MNIT Jaipur > > > > > > On Fri, Mar 31, 2017 at 8:17 PM, Sebastian Raschka > wrote: > > Hi, Shuchi, > > > > regarding labels_true: you?d only be able to compute the rand index > adjusted for chance if you have the ground truth labels iof the training > examples in your dataset. > > > > The second parameter, labels_pred, takes in the predicted cluster labels > (indices) that you got from the clustering. E.g, > > > > dbscn = DBSCAN() > > labels_pred = dbscn.fit(X).predict(X) > > > > Best, > > Sebastian > > > > > > > On Mar 31, 2017, at 12:02 AM, Shuchi Mala wrote: > > > > > > Thank you so much for your quick reply. I have one more doubt. The > below statement is used to calculate rand score. > > > > > > metrics.adjusted_rand_score(labels_true, labels_pred) > > > In my case what will be labels_true and labels_pred and how I will > calculate labels_pred? > > > > > > With Best Regards, > > > Shuchi Mala > > > Research Scholar > > > Department of Civil Engineering > > > MNIT Jaipur > > > > > > > > > On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby < > shane.grigsby at colorado.edu> wrote: > > > Since you're using lat / long coords, you'll also want to convert them > to radians and specify 'haversine' as your distance metric; i.e. : > > > > > > coords = np.vstack([lats.ravel(),longs.ravel()]).T > > > coords *= np.pi / 180. # to radians > > > > > > ...and: > > > > > > db = DBSCAN(eps=0.3, min_samples=10, metric='haversine') > > > # replace eps and min_samples as appropriate > > > db.fit(coords) > > > > > > Cheers, > > > Shane > > > > > > > > > On 03/30, Sebastian Raschka wrote: > > > Hi, Shuchi, > > > > > > 1. How can I add data to the data set of the package? > > > > > > You don?t need to add your dataset to the dataset module to run your > analysis. A convenient way to load it into a numpy array would be via > pandas. E.g., > > > > > > import pandas as pd > > > df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) > > > X = df.values > > > > > > 2. How I can calculate Rand index for my data? > > > > > > After you ran the clustering, you can use the ?adjusted_rand_score? > function, e.g., see > > > http://scikit-learn.org/stable/modules/clustering. > html#adjusted-rand-score > > > > > > 3. How to use make_blobs command for my data? > > > > > > The make_blobs command is just a utility function to create > toydatasets, you wouldn?t need it in your case since you already have > ?real? data. > > > > > > Best, > > > Sebastian > > > > > > > > > On Mar 30, 2017, at 4:51 AM, Shuchi Mala wrote: > > > > > > Hi everyone, > > > > > > I have the data with following attributes: (Latitude, Longitude). Now > I am performing clustering using DBSCAN for my data. I have following > doubts: > > > > > > 1. How can I add data to the data set of the package? > > > 2. How I can calculate Rand index for my data? > > > 3. How to use make_blobs command for my data? > > > > > > Sample of my data is : > > > Latitude Longitude > > > 37.76901 -122.429299 > > > 37.76904 -122.42913 > > > 37.76878 -122.429092 > > > 37.7763 -122.424249 > > > 37.77627 -122.424657 > > > > > > > > > With Best Regards, > > > Shuchi Mala > > > Research Scholar > > > Department of Civil Engineering > > > MNIT Jaipur > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > > > *PhD candidate & Research Assistant* > > > *Cooperative Institute for Research in Environmental Sciences (CIRES)* > > > *University of Colorado at Boulder* > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shuchi.23 at gmail.com Tue Apr 4 04:35:55 2017 From: shuchi.23 at gmail.com (Shuchi Mala) Date: Tue, 4 Apr 2017 14:05:55 +0530 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: References: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> <293EEA4E-2D51-4151-9A1F-D57CF628A71C@gmail.com> Message-ID: Hi Raschka, I need an urgent help. how I can use Statsmodels Poisson function function (statsmodels.genmod.families.Poisson) with Sci-Kit Learn's cross validation metrics (cross_val_score, ShuffleSplit, cross_val_predict)? With Best Regards, Shuchi Mala Research Scholar Department of Civil Engineering MNIT Jaipur On Tue, Apr 4, 2017 at 9:15 AM, Shuchi Mala wrote: > Hi Raschka, > > I want to know how to use cross validation when other regression model > such as poisson is used in place of linear? > > Kindly help. > > With Best Regards, > Shuchi Mala > Research Scholar > Department of Civil Engineering > MNIT Jaipur > > > On Mon, Apr 3, 2017 at 8:05 PM, Sebastian Raschka > wrote: > >> Don?t get me wrong, but you?d have to either manually label them >> yourself, asking domain experts, or use platforms like Amazon Turk (or >> collect them in some other way). >> >> > On Apr 3, 2017, at 7:38 AM, Shuchi Mala wrote: >> > >> > How can I get ground truth labels of the training examples in my >> dataset? >> > >> > With Best Regards, >> > Shuchi Mala >> > Research Scholar >> > Department of Civil Engineering >> > MNIT Jaipur >> > >> > >> > On Fri, Mar 31, 2017 at 8:17 PM, Sebastian Raschka < >> se.raschka at gmail.com> wrote: >> > Hi, Shuchi, >> > >> > regarding labels_true: you?d only be able to compute the rand index >> adjusted for chance if you have the ground truth labels iof the training >> examples in your dataset. >> > >> > The second parameter, labels_pred, takes in the predicted cluster >> labels (indices) that you got from the clustering. E.g, >> > >> > dbscn = DBSCAN() >> > labels_pred = dbscn.fit(X).predict(X) >> > >> > Best, >> > Sebastian >> > >> > >> > > On Mar 31, 2017, at 12:02 AM, Shuchi Mala >> wrote: >> > > >> > > Thank you so much for your quick reply. I have one more doubt. The >> below statement is used to calculate rand score. >> > > >> > > metrics.adjusted_rand_score(labels_true, labels_pred) >> > > In my case what will be labels_true and labels_pred and how I will >> calculate labels_pred? >> > > >> > > With Best Regards, >> > > Shuchi Mala >> > > Research Scholar >> > > Department of Civil Engineering >> > > MNIT Jaipur >> > > >> > > >> > > On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby < >> shane.grigsby at colorado.edu> wrote: >> > > Since you're using lat / long coords, you'll also want to convert >> them to radians and specify 'haversine' as your distance metric; i.e. : >> > > >> > > coords = np.vstack([lats.ravel(),longs.ravel()]).T >> > > coords *= np.pi / 180. # to radians >> > > >> > > ...and: >> > > >> > > db = DBSCAN(eps=0.3, min_samples=10, metric='haversine') >> > > # replace eps and min_samples as appropriate >> > > db.fit(coords) >> > > >> > > Cheers, >> > > Shane >> > > >> > > >> > > On 03/30, Sebastian Raschka wrote: >> > > Hi, Shuchi, >> > > >> > > 1. How can I add data to the data set of the package? >> > > >> > > You don?t need to add your dataset to the dataset module to run your >> analysis. A convenient way to load it into a numpy array would be via >> pandas. E.g., >> > > >> > > import pandas as pd >> > > df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) >> > > X = df.values >> > > >> > > 2. How I can calculate Rand index for my data? >> > > >> > > After you ran the clustering, you can use the ?adjusted_rand_score? >> function, e.g., see >> > > http://scikit-learn.org/stable/modules/clustering.html# >> adjusted-rand-score >> > > >> > > 3. How to use make_blobs command for my data? >> > > >> > > The make_blobs command is just a utility function to create >> toydatasets, you wouldn?t need it in your case since you already have >> ?real? data. >> > > >> > > Best, >> > > Sebastian >> > > >> > > >> > > On Mar 30, 2017, at 4:51 AM, Shuchi Mala wrote: >> > > >> > > Hi everyone, >> > > >> > > I have the data with following attributes: (Latitude, Longitude). Now >> I am performing clustering using DBSCAN for my data. I have following >> doubts: >> > > >> > > 1. How can I add data to the data set of the package? >> > > 2. How I can calculate Rand index for my data? >> > > 3. How to use make_blobs command for my data? >> > > >> > > Sample of my data is : >> > > Latitude Longitude >> > > 37.76901 -122.429299 >> > > 37.76904 -122.42913 >> > > 37.76878 -122.429092 >> > > 37.7763 -122.424249 >> > > 37.77627 -122.424657 >> > > >> > > >> > > With Best Regards, >> > > Shuchi Mala >> > > Research Scholar >> > > Department of Civil Engineering >> > > MNIT Jaipur >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > >> > > -- >> > > *PhD candidate & Research Assistant* >> > > *Cooperative Institute for Research in Environmental Sciences (CIRES)* >> > > *University of Colorado at Boulder* >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shuchi.23 at gmail.com Tue Apr 4 23:57:10 2017 From: shuchi.23 at gmail.com (Shuchi Mala) Date: Wed, 5 Apr 2017 09:27:10 +0530 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: References: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> <293EEA4E-2D51-4151-9A1F-D57CF628A71C@gmail.com> Message-ID: Hi Raschka, I need an urgent help. how I can use Statsmodels Poisson function function (statsmodels.genmod.families.Poisson) with Sci-Kit Learn's cross validation metrics (cross_val_score, ShuffleSplit, cross_val_predict)? With Best Regards, Shuchi Mala Research Scholar Department of Civil Engineering MNIT Jaipur On Tue, Apr 4, 2017 at 2:05 PM, Shuchi Mala wrote: > Hi Raschka, > > I need an urgent help. how I can use Statsmodels Poisson function > function (statsmodels.genmod.families.Poisson) with Sci-Kit Learn's cross > validation metrics (cross_val_score, ShuffleSplit, cross_val_predict)? > > With Best Regards, > Shuchi Mala > Research Scholar > Department of Civil Engineering > MNIT Jaipur > > > On Tue, Apr 4, 2017 at 9:15 AM, Shuchi Mala wrote: > >> Hi Raschka, >> >> I want to know how to use cross validation when other regression model >> such as poisson is used in place of linear? >> >> Kindly help. >> >> With Best Regards, >> Shuchi Mala >> Research Scholar >> Department of Civil Engineering >> MNIT Jaipur >> >> >> On Mon, Apr 3, 2017 at 8:05 PM, Sebastian Raschka >> wrote: >> >>> Don?t get me wrong, but you?d have to either manually label them >>> yourself, asking domain experts, or use platforms like Amazon Turk (or >>> collect them in some other way). >>> >>> > On Apr 3, 2017, at 7:38 AM, Shuchi Mala wrote: >>> > >>> > How can I get ground truth labels of the training examples in my >>> dataset? >>> > >>> > With Best Regards, >>> > Shuchi Mala >>> > Research Scholar >>> > Department of Civil Engineering >>> > MNIT Jaipur >>> > >>> > >>> > On Fri, Mar 31, 2017 at 8:17 PM, Sebastian Raschka < >>> se.raschka at gmail.com> wrote: >>> > Hi, Shuchi, >>> > >>> > regarding labels_true: you?d only be able to compute the rand index >>> adjusted for chance if you have the ground truth labels iof the training >>> examples in your dataset. >>> > >>> > The second parameter, labels_pred, takes in the predicted cluster >>> labels (indices) that you got from the clustering. E.g, >>> > >>> > dbscn = DBSCAN() >>> > labels_pred = dbscn.fit(X).predict(X) >>> > >>> > Best, >>> > Sebastian >>> > >>> > >>> > > On Mar 31, 2017, at 12:02 AM, Shuchi Mala >>> wrote: >>> > > >>> > > Thank you so much for your quick reply. I have one more doubt. The >>> below statement is used to calculate rand score. >>> > > >>> > > metrics.adjusted_rand_score(labels_true, labels_pred) >>> > > In my case what will be labels_true and labels_pred and how I will >>> calculate labels_pred? >>> > > >>> > > With Best Regards, >>> > > Shuchi Mala >>> > > Research Scholar >>> > > Department of Civil Engineering >>> > > MNIT Jaipur >>> > > >>> > > >>> > > On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby < >>> shane.grigsby at colorado.edu> wrote: >>> > > Since you're using lat / long coords, you'll also want to convert >>> them to radians and specify 'haversine' as your distance metric; i.e. : >>> > > >>> > > coords = np.vstack([lats.ravel(),longs.ravel()]).T >>> > > coords *= np.pi / 180. # to radians >>> > > >>> > > ...and: >>> > > >>> > > db = DBSCAN(eps=0.3, min_samples=10, metric='haversine') >>> > > # replace eps and min_samples as appropriate >>> > > db.fit(coords) >>> > > >>> > > Cheers, >>> > > Shane >>> > > >>> > > >>> > > On 03/30, Sebastian Raschka wrote: >>> > > Hi, Shuchi, >>> > > >>> > > 1. How can I add data to the data set of the package? >>> > > >>> > > You don?t need to add your dataset to the dataset module to run your >>> analysis. A convenient way to load it into a numpy array would be via >>> pandas. E.g., >>> > > >>> > > import pandas as pd >>> > > df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) >>> > > X = df.values >>> > > >>> > > 2. How I can calculate Rand index for my data? >>> > > >>> > > After you ran the clustering, you can use the ?adjusted_rand_score? >>> function, e.g., see >>> > > http://scikit-learn.org/stable/modules/clustering.html#adjus >>> ted-rand-score >>> > > >>> > > 3. How to use make_blobs command for my data? >>> > > >>> > > The make_blobs command is just a utility function to create >>> toydatasets, you wouldn?t need it in your case since you already have >>> ?real? data. >>> > > >>> > > Best, >>> > > Sebastian >>> > > >>> > > >>> > > On Mar 30, 2017, at 4:51 AM, Shuchi Mala >>> wrote: >>> > > >>> > > Hi everyone, >>> > > >>> > > I have the data with following attributes: (Latitude, Longitude). >>> Now I am performing clustering using DBSCAN for my data. I have following >>> doubts: >>> > > >>> > > 1. How can I add data to the data set of the package? >>> > > 2. How I can calculate Rand index for my data? >>> > > 3. How to use make_blobs command for my data? >>> > > >>> > > Sample of my data is : >>> > > Latitude Longitude >>> > > 37.76901 -122.429299 >>> > > 37.76904 -122.42913 >>> > > 37.76878 -122.429092 >>> > > 37.7763 -122.424249 >>> > > 37.77627 -122.424657 >>> > > >>> > > >>> > > With Best Regards, >>> > > Shuchi Mala >>> > > Research Scholar >>> > > Department of Civil Engineering >>> > > MNIT Jaipur >>> > > >>> > > _______________________________________________ >>> > > scikit-learn mailing list >>> > > scikit-learn at python.org >>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>> > > >>> > > _______________________________________________ >>> > > scikit-learn mailing list >>> > > scikit-learn at python.org >>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>> > > >>> > > -- >>> > > *PhD candidate & Research Assistant* >>> > > *Cooperative Institute for Research in Environmental Sciences >>> (CIRES)* >>> > > *University of Colorado at Boulder* >>> > > >>> > > _______________________________________________ >>> > > scikit-learn mailing list >>> > > scikit-learn at python.org >>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>> > > >>> > > _______________________________________________ >>> > > scikit-learn mailing list >>> > > scikit-learn at python.org >>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From awpaundu at gmail.com Wed Apr 5 09:36:06 2017 From: awpaundu at gmail.com (Ady Wahyudi Paundu) Date: Wed, 5 Apr 2017 22:36:06 +0900 Subject: [scikit-learn] Multiple normal scenario for OCSVM Message-ID: Good day Scikit-Learn Masters, I have used Scikit-Learns OCSVM module previously with satisfying results. However on my current tasks I have this problem for one-class analysis: In my previous cases, I used OCSVM for Anomaly detector, and the normal classes in each cases were coming from one scenario. Now, I want to create one Anomaly detector system, with multiple normal scenario (in this case, 3 different normal scenario). Lets say I have scenario A, B and C, and I want to distinguish all data that is not coming from A and B and C. What I have been tried is combining all training data A and B and C into one data set and fit it using OCSVM module. When I tested the output model to several anomaly data-set it worked good. However, when I tested it against either one of the normal scenario, it gave a very high False Positives (AUROC: 99%). So my question, is it because a bad approach? by combining all the different normal data set into one training data set. Or is it because I was using it (the OCSVM) wrong? (i use 'rbf' kernel with nu and gamma set to 0.001) Or is it the case with wrong tools? another algorithm perhaps? I dont know if this is a proper question to ask here, so if it is not (maybe because this is just a Machine Learning question in general), just disregard it. Thank you in advance Best regards, Ady From shane.grigsby at colorado.edu Wed Apr 5 11:30:30 2017 From: shane.grigsby at colorado.edu (Shane Grigsby) Date: Wed, 5 Apr 2017 09:30:30 -0600 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: References: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> <293EEA4E-2D51-4151-9A1F-D57CF628A71C@gmail.com> Message-ID: <20170405153030.vhscqloolwmq2mjf@MacBook-Pro-3.local> Hi Shuchi, You probably want to query the Statsmodels community for this; they have a google groups board here: https://groups.google.com/forum/#!forum/pystatsmodels Cheers, Shane On 04/05, Shuchi Mala wrote: >Hi Raschka, > >I need an urgent help. how I can use Statsmodels Poisson function >function (statsmodels.genmod.families.Poisson) with Sci-Kit Learn's cross >validation metrics (cross_val_score, ShuffleSplit, cross_val_predict)? > >With Best Regards, >Shuchi Mala >Research Scholar >Department of Civil Engineering >MNIT Jaipur > > >On Tue, Apr 4, 2017 at 2:05 PM, Shuchi Mala wrote: > >> Hi Raschka, >> >> I need an urgent help. how I can use Statsmodels Poisson function >> function (statsmodels.genmod.families.Poisson) with Sci-Kit Learn's cross >> validation metrics (cross_val_score, ShuffleSplit, cross_val_predict)? >> >> With Best Regards, >> Shuchi Mala >> Research Scholar >> Department of Civil Engineering >> MNIT Jaipur >> >> >> On Tue, Apr 4, 2017 at 9:15 AM, Shuchi Mala wrote: >> >>> Hi Raschka, >>> >>> I want to know how to use cross validation when other regression model >>> such as poisson is used in place of linear? >>> >>> Kindly help. >>> >>> With Best Regards, >>> Shuchi Mala >>> Research Scholar >>> Department of Civil Engineering >>> MNIT Jaipur >>> >>> >>> On Mon, Apr 3, 2017 at 8:05 PM, Sebastian Raschka >>> wrote: >>> >>>> Don?t get me wrong, but you?d have to either manually label them >>>> yourself, asking domain experts, or use platforms like Amazon Turk (or >>>> collect them in some other way). >>>> >>>> > On Apr 3, 2017, at 7:38 AM, Shuchi Mala wrote: >>>> > >>>> > How can I get ground truth labels of the training examples in my >>>> dataset? >>>> > >>>> > With Best Regards, >>>> > Shuchi Mala >>>> > Research Scholar >>>> > Department of Civil Engineering >>>> > MNIT Jaipur >>>> > >>>> > >>>> > On Fri, Mar 31, 2017 at 8:17 PM, Sebastian Raschka < >>>> se.raschka at gmail.com> wrote: >>>> > Hi, Shuchi, >>>> > >>>> > regarding labels_true: you?d only be able to compute the rand index >>>> adjusted for chance if you have the ground truth labels iof the training >>>> examples in your dataset. >>>> > >>>> > The second parameter, labels_pred, takes in the predicted cluster >>>> labels (indices) that you got from the clustering. E.g, >>>> > >>>> > dbscn = DBSCAN() >>>> > labels_pred = dbscn.fit(X).predict(X) >>>> > >>>> > Best, >>>> > Sebastian >>>> > >>>> > >>>> > > On Mar 31, 2017, at 12:02 AM, Shuchi Mala >>>> wrote: >>>> > > >>>> > > Thank you so much for your quick reply. I have one more doubt. The >>>> below statement is used to calculate rand score. >>>> > > >>>> > > metrics.adjusted_rand_score(labels_true, labels_pred) >>>> > > In my case what will be labels_true and labels_pred and how I will >>>> calculate labels_pred? >>>> > > >>>> > > With Best Regards, >>>> > > Shuchi Mala >>>> > > Research Scholar >>>> > > Department of Civil Engineering >>>> > > MNIT Jaipur >>>> > > >>>> > > >>>> > > On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby < >>>> shane.grigsby at colorado.edu> wrote: >>>> > > Since you're using lat / long coords, you'll also want to convert >>>> them to radians and specify 'haversine' as your distance metric; i.e. : >>>> > > >>>> > > coords = np.vstack([lats.ravel(),longs.ravel()]).T >>>> > > coords *= np.pi / 180. # to radians >>>> > > >>>> > > ...and: >>>> > > >>>> > > db = DBSCAN(eps=0.3, min_samples=10, metric='haversine') >>>> > > # replace eps and min_samples as appropriate >>>> > > db.fit(coords) >>>> > > >>>> > > Cheers, >>>> > > Shane >>>> > > >>>> > > >>>> > > On 03/30, Sebastian Raschka wrote: >>>> > > Hi, Shuchi, >>>> > > >>>> > > 1. How can I add data to the data set of the package? >>>> > > >>>> > > You don?t need to add your dataset to the dataset module to run your >>>> analysis. A convenient way to load it into a numpy array would be via >>>> pandas. E.g., >>>> > > >>>> > > import pandas as pd >>>> > > df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) >>>> > > X = df.values >>>> > > >>>> > > 2. How I can calculate Rand index for my data? >>>> > > >>>> > > After you ran the clustering, you can use the ?adjusted_rand_score? >>>> function, e.g., see >>>> > > http://scikit-learn.org/stable/modules/clustering.html#adjus >>>> ted-rand-score >>>> > > >>>> > > 3. How to use make_blobs command for my data? >>>> > > >>>> > > The make_blobs command is just a utility function to create >>>> toydatasets, you wouldn?t need it in your case since you already have >>>> ?real? data. >>>> > > >>>> > > Best, >>>> > > Sebastian >>>> > > >>>> > > >>>> > > On Mar 30, 2017, at 4:51 AM, Shuchi Mala >>>> wrote: >>>> > > >>>> > > Hi everyone, >>>> > > >>>> > > I have the data with following attributes: (Latitude, Longitude). >>>> Now I am performing clustering using DBSCAN for my data. I have following >>>> doubts: >>>> > > >>>> > > 1. How can I add data to the data set of the package? >>>> > > 2. How I can calculate Rand index for my data? >>>> > > 3. How to use make_blobs command for my data? >>>> > > >>>> > > Sample of my data is : >>>> > > Latitude Longitude >>>> > > 37.76901 -122.429299 >>>> > > 37.76904 -122.42913 >>>> > > 37.76878 -122.429092 >>>> > > 37.7763 -122.424249 >>>> > > 37.77627 -122.424657 >>>> > > >>>> > > >>>> > > With Best Regards, >>>> > > Shuchi Mala >>>> > > Research Scholar >>>> > > Department of Civil Engineering >>>> > > MNIT Jaipur >>>> > > >>>> > > _______________________________________________ >>>> > > scikit-learn mailing list >>>> > > scikit-learn at python.org >>>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > > >>>> > > _______________________________________________ >>>> > > scikit-learn mailing list >>>> > > scikit-learn at python.org >>>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > > >>>> > > -- >>>> > > *PhD candidate & Research Assistant* >>>> > > *Cooperative Institute for Research in Environmental Sciences >>>> (CIRES)* >>>> > > *University of Colorado at Boulder* >>>> > > >>>> > > _______________________________________________ >>>> > > scikit-learn mailing list >>>> > > scikit-learn at python.org >>>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > > >>>> > > _______________________________________________ >>>> > > scikit-learn mailing list >>>> > > scikit-learn at python.org >>>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn at python.org >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn at python.org >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >> >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -- *PhD candidate & Research Assistant* *Cooperative Institute for Research in Environmental Sciences (CIRES)* *University of Colorado at Boulder* From jmschreiber91 at gmail.com Wed Apr 5 17:48:17 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Wed, 5 Apr 2017 14:48:17 -0700 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: <20170405153030.vhscqloolwmq2mjf@MacBook-Pro-3.local> References: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> <293EEA4E-2D51-4151-9A1F-D57CF628A71C@gmail.com> <20170405153030.vhscqloolwmq2mjf@MacBook-Pro-3.local> Message-ID: Also, in general it's not appropriate to repeatedly ping someone on this mailing list for 'urgent help.' On Wed, Apr 5, 2017 at 8:30 AM, Shane Grigsby wrote: > Hi Shuchi, > You probably want to query the Statsmodels community for this; they have a > google groups board here: > > https://groups.google.com/forum/#!forum/pystatsmodels > > Cheers, > Shane > > > On 04/05, Shuchi Mala wrote: > >> Hi Raschka, >> >> I need an urgent help. how I can use Statsmodels Poisson function >> function (statsmodels.genmod.families.Poisson) with Sci-Kit Learn's cross >> validation metrics (cross_val_score, ShuffleSplit, cross_val_predict)? >> >> With Best Regards, >> Shuchi Mala >> Research Scholar >> Department of Civil Engineering >> MNIT Jaipur >> >> >> On Tue, Apr 4, 2017 at 2:05 PM, Shuchi Mala wrote: >> >> Hi Raschka, >>> >>> I need an urgent help. how I can use Statsmodels Poisson function >>> function (statsmodels.genmod.families.Poisson) with Sci-Kit Learn's >>> cross >>> validation metrics (cross_val_score, ShuffleSplit, cross_val_predict)? >>> >>> With Best Regards, >>> Shuchi Mala >>> Research Scholar >>> Department of Civil Engineering >>> MNIT Jaipur >>> >>> >>> On Tue, Apr 4, 2017 at 9:15 AM, Shuchi Mala wrote: >>> >>> Hi Raschka, >>>> >>>> I want to know how to use cross validation when other regression model >>>> such as poisson is used in place of linear? >>>> >>>> Kindly help. >>>> >>>> With Best Regards, >>>> Shuchi Mala >>>> Research Scholar >>>> Department of Civil Engineering >>>> MNIT Jaipur >>>> >>>> >>>> On Mon, Apr 3, 2017 at 8:05 PM, Sebastian Raschka >>> > >>>> wrote: >>>> >>>> Don?t get me wrong, but you?d have to either manually label them >>>>> yourself, asking domain experts, or use platforms like Amazon Turk (or >>>>> collect them in some other way). >>>>> >>>>> > On Apr 3, 2017, at 7:38 AM, Shuchi Mala wrote: >>>>> > >>>>> > How can I get ground truth labels of the training examples in my >>>>> dataset? >>>>> > >>>>> > With Best Regards, >>>>> > Shuchi Mala >>>>> > Research Scholar >>>>> > Department of Civil Engineering >>>>> > MNIT Jaipur >>>>> > >>>>> > >>>>> > On Fri, Mar 31, 2017 at 8:17 PM, Sebastian Raschka < >>>>> se.raschka at gmail.com> wrote: >>>>> > Hi, Shuchi, >>>>> > >>>>> > regarding labels_true: you?d only be able to compute the rand index >>>>> adjusted for chance if you have the ground truth labels iof the >>>>> training >>>>> examples in your dataset. >>>>> > >>>>> > The second parameter, labels_pred, takes in the predicted cluster >>>>> labels (indices) that you got from the clustering. E.g, >>>>> > >>>>> > dbscn = DBSCAN() >>>>> > labels_pred = dbscn.fit(X).predict(X) >>>>> > >>>>> > Best, >>>>> > Sebastian >>>>> > >>>>> > >>>>> > > On Mar 31, 2017, at 12:02 AM, Shuchi Mala >>>>> wrote: >>>>> > > >>>>> > > Thank you so much for your quick reply. I have one more doubt. The >>>>> below statement is used to calculate rand score. >>>>> > > >>>>> > > metrics.adjusted_rand_score(labels_true, labels_pred) >>>>> > > In my case what will be labels_true and labels_pred and how I will >>>>> calculate labels_pred? >>>>> > > >>>>> > > With Best Regards, >>>>> > > Shuchi Mala >>>>> > > Research Scholar >>>>> > > Department of Civil Engineering >>>>> > > MNIT Jaipur >>>>> > > >>>>> > > >>>>> > > On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby < >>>>> shane.grigsby at colorado.edu> wrote: >>>>> > > Since you're using lat / long coords, you'll also want to convert >>>>> them to radians and specify 'haversine' as your distance metric; i.e. : >>>>> > > >>>>> > > coords = np.vstack([lats.ravel(),longs.ravel()]).T >>>>> > > coords *= np.pi / 180. # to radians >>>>> > > >>>>> > > ...and: >>>>> > > >>>>> > > db = DBSCAN(eps=0.3, min_samples=10, metric='haversine') >>>>> > > # replace eps and min_samples as appropriate >>>>> > > db.fit(coords) >>>>> > > >>>>> > > Cheers, >>>>> > > Shane >>>>> > > >>>>> > > >>>>> > > On 03/30, Sebastian Raschka wrote: >>>>> > > Hi, Shuchi, >>>>> > > >>>>> > > 1. How can I add data to the data set of the package? >>>>> > > >>>>> > > You don?t need to add your dataset to the dataset module to run >>>>> your >>>>> analysis. A convenient way to load it into a numpy array would be via >>>>> pandas. E.g., >>>>> > > >>>>> > > import pandas as pd >>>>> > > df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) >>>>> > > X = df.values >>>>> > > >>>>> > > 2. How I can calculate Rand index for my data? >>>>> > > >>>>> > > After you ran the clustering, you can use the ?adjusted_rand_score? >>>>> function, e.g., see >>>>> > > http://scikit-learn.org/stable/modules/clustering.html#adjus >>>>> ted-rand-score >>>>> > > >>>>> > > 3. How to use make_blobs command for my data? >>>>> > > >>>>> > > The make_blobs command is just a utility function to create >>>>> toydatasets, you wouldn?t need it in your case since you already have >>>>> ?real? data. >>>>> > > >>>>> > > Best, >>>>> > > Sebastian >>>>> > > >>>>> > > >>>>> > > On Mar 30, 2017, at 4:51 AM, Shuchi Mala >>>>> wrote: >>>>> > > >>>>> > > Hi everyone, >>>>> > > >>>>> > > I have the data with following attributes: (Latitude, Longitude). >>>>> Now I am performing clustering using DBSCAN for my data. I have >>>>> following >>>>> doubts: >>>>> > > >>>>> > > 1. How can I add data to the data set of the package? >>>>> > > 2. How I can calculate Rand index for my data? >>>>> > > 3. How to use make_blobs command for my data? >>>>> > > >>>>> > > Sample of my data is : >>>>> > > Latitude Longitude >>>>> > > 37.76901 -122.429299 >>>>> > > 37.76904 -122.42913 >>>>> > > 37.76878 -122.429092 >>>>> > > 37.7763 -122.424249 >>>>> > > 37.77627 -122.424657 >>>>> > > >>>>> > > >>>>> > > With Best Regards, >>>>> > > Shuchi Mala >>>>> > > Research Scholar >>>>> > > Department of Civil Engineering >>>>> > > MNIT Jaipur >>>>> > > >>>>> > > _______________________________________________ >>>>> > > scikit-learn mailing list >>>>> > > scikit-learn at python.org >>>>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>>>> > > >>>>> > > _______________________________________________ >>>>> > > scikit-learn mailing list >>>>> > > scikit-learn at python.org >>>>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>>>> > > >>>>> > > -- >>>>> > > *PhD candidate & Research Assistant* >>>>> > > *Cooperative Institute for Research in Environmental Sciences >>>>> (CIRES)* >>>>> > > *University of Colorado at Boulder* >>>>> > > >>>>> > > _______________________________________________ >>>>> > > scikit-learn mailing list >>>>> > > scikit-learn at python.org >>>>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>>>> > > >>>>> > > _______________________________________________ >>>>> > > scikit-learn mailing list >>>>> > > scikit-learn at python.org >>>>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>>>> > >>>>> > _______________________________________________ >>>>> > scikit-learn mailing list >>>>> > scikit-learn at python.org >>>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>>> > >>>>> > _______________________________________________ >>>>> > scikit-learn mailing list >>>>> > scikit-learn at python.org >>>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> >>> > _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > *PhD candidate & Research Assistant* > *Cooperative Institute for Research in Environmental Sciences (CIRES)* > *University of Colorado at Boulder* > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From albertthomas88 at gmail.com Wed Apr 5 17:54:18 2017 From: albertthomas88 at gmail.com (Albert Thomas) Date: Wed, 05 Apr 2017 21:54:18 +0000 Subject: [scikit-learn] Multiple normal scenario for OCSVM In-Reply-To: References: Message-ID: Hi Ady, Overfitting is a possible explanation. If your model learnt your normal scenarios too well then every abnormal data will be predicted as abnormal (so you will have a good performance for anomalies) however none of the normal instances of the test set will be in the normal region (so you will have a high FPR). Albert On Wed, 5 Apr 2017 at 15:37, Ady Wahyudi Paundu wrote: > Good day Scikit-Learn Masters, > > I have used Scikit-Learns OCSVM module previously with satisfying results. > However on my current tasks I have this problem for one-class analysis: > > In my previous cases, I used OCSVM for Anomaly detector, and the > normal classes in each cases were coming from one scenario. > Now, I want to create one Anomaly detector system, with multiple > normal scenario (in this case, 3 different normal scenario). Lets say > I have scenario A, B and C, and I want to distinguish all data that is > not coming from A and B and C. > What I have been tried is combining all training data A and B and C > into one data set and fit it using OCSVM module. When I tested the > output model to several anomaly data-set it worked good. However, when > I tested it against either one of the normal scenario, it gave a very > high False Positives (AUROC: 99%). > > So my question, is it because a bad approach? by combining all the > different normal data set into one training data set. > Or is it because I was using it (the OCSVM) wrong? (i use 'rbf' kernel > with nu and gamma set to 0.001) > Or is it the case with wrong tools? another algorithm perhaps? > > I dont know if this is a proper question to ask here, so if it is not > (maybe because this is just a Machine Learning question in general), > just disregard it. > > Thank you in advance > > Best regards, > Ady > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From awpaundu at gmail.com Wed Apr 5 23:06:29 2017 From: awpaundu at gmail.com (Ady Wahyudi Paundu) Date: Thu, 6 Apr 2017 12:06:29 +0900 Subject: [scikit-learn] Multiple normal scenario for OCSVM In-Reply-To: References: Message-ID: Hi Albert, Thank you for replying. You are right, a high FPR might indicate an overfitting problem. I have been having discussions with friends and our insight so far is that I was worrying a non-existent problem. Feeding two dataset of both 'Normal Classes' into the 'decision_function' of OCSVM and read its AUROC would not give any info on the quality of Anomaly Detector. A meaningful reading only if feeding it with the 'normal' class and the 'anomaly' class. Again thank you for your kind reply. Best regards, Ady On 4/6/17, Albert Thomas wrote: > Hi Ady, > > Overfitting is a possible explanation. If your model learnt your normal > scenarios too well then every abnormal data will be predicted as abnormal > (so you will have a good performance for anomalies) however none of the > normal instances of the test set will be in the normal region (so you will > have a high FPR). > > Albert > > On Wed, 5 Apr 2017 at 15:37, Ady Wahyudi Paundu wrote: > >> Good day Scikit-Learn Masters, >> >> I have used Scikit-Learns OCSVM module previously with satisfying >> results. >> However on my current tasks I have this problem for one-class analysis: >> >> In my previous cases, I used OCSVM for Anomaly detector, and the >> normal classes in each cases were coming from one scenario. >> Now, I want to create one Anomaly detector system, with multiple >> normal scenario (in this case, 3 different normal scenario). Lets say >> I have scenario A, B and C, and I want to distinguish all data that is >> not coming from A and B and C. >> What I have been tried is combining all training data A and B and C >> into one data set and fit it using OCSVM module. When I tested the >> output model to several anomaly data-set it worked good. However, when >> I tested it against either one of the normal scenario, it gave a very >> high False Positives (AUROC: 99%). >> >> So my question, is it because a bad approach? by combining all the >> different normal data set into one training data set. >> Or is it because I was using it (the OCSVM) wrong? (i use 'rbf' kernel >> with nu and gamma set to 0.001) >> Or is it the case with wrong tools? another algorithm perhaps? >> >> I dont know if this is a proper question to ask here, so if it is not >> (maybe because this is just a Machine Learning question in general), >> just disregard it. >> >> Thank you in advance >> >> Best regards, >> Ady >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > From s.atasever at gmail.com Thu Apr 6 08:27:00 2017 From: s.atasever at gmail.com (Sema Atasever) Date: Thu, 6 Apr 2017 15:27:00 +0300 Subject: [scikit-learn] sklearn.cluster.Birch Message-ID: Dear scikit-learn members, I have a dat file where the columns represent the features and the rows represent the protein. (you can see the dat file in the attachment) i want to use *sklearn.cluster.Birch* with this text file values as X which is the Parameters: X : {array-like, sparse matrix} How can i transform this text file values into X : {array-like, sparse matrix}. I would appreciate if you could advise on some methods. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: features.dat Type: application/octet-stream Size: 242600 bytes Desc: not available URL: From manojkumarsivaraj334 at gmail.com Thu Apr 6 09:29:34 2017 From: manojkumarsivaraj334 at gmail.com (Manoj Kumar) Date: Thu, 6 Apr 2017 09:29:34 -0400 Subject: [scikit-learn] sklearn.cluster.Birch In-Reply-To: References: Message-ID: Hi, See: https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html for one way. On Thu, Apr 6, 2017 at 8:27 AM, Sema Atasever wrote: > Dear scikit-learn members, > > I have a dat file where the columns represent the features and the rows > represent the protein. (you can see the dat file in the attachment) > > > i want to use *sklearn.cluster.Birch* with this text file values as X > which is the > > Parameters: > X : {array-like, sparse matrix} > > How can i transform this text file values into X : {array-like, sparse > matrix}. > > I would appreciate if you could advise on some methods. > Thanks. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Manoj, http://github.com/MechCoder -------------- next part -------------- An HTML attachment was scrubbed... URL: From loic.esteve at ymail.com Fri Apr 7 07:26:32 2017 From: loic.esteve at ymail.com (=?UTF-8?B?TG/Dr2MgRXN0w6h2ZQ==?=) Date: Fri, 7 Apr 2017 13:26:32 +0200 Subject: [scikit-learn] Bookmarklet to view documentation on CircleCI In-Reply-To: References: <20161221223338.GA334300@phare.normalesup.org> Message-ID: <6135d6a4-6b8e-8af4-95fc-1a3366849c5d@ymail.com> On 12/22/2016 01:48 AM, Joel Nothman wrote: > Well, you can as a browser extension. I just haven't bothered to > investigate that technology when there's so much code to review and write. > > On 22 December 2016 at 09:33, Gael Varoquaux > > wrote: > > It's super neat. It's a pity that I don't see a way of integrating it to > the github interface. > I bit the bullet today and created a Userscript for Greasemonkey (on Chrome you can apparently use Tampermonkey) to add a button to the PR (at the moment right of the "Edit" button to edit the title) that pretty does the same thing as the bookmarklet. A snapshot is attached for clarity. This link to the raw gist should install it in your Greasemonkey user scripts: https://gist.github.com/lesteve/470170f288884ec052bcf4bc4ffe958a/raw/4270d0c731c3f1c797df3b014877b76d87b4e6bd/add_button_for_pr_circleci_doc.user.js Any comments/improvements, let me know. At the moment the button is only showed in the "Conversation" tab. Also note that for some reason that are beyond my web skills you may need to refresh the page to see the button, e.g. if you go first to https://github.com/scikit-learn/scikit-learn/pull/7995/files and then click on the "Conversation" page. It seems like the 'load' event is not triggered. Cheers, Lo?c -------------- next part -------------- A non-text attachment was scrubbed... Name: snapshot.png Type: image/png Size: 22886 bytes Desc: not available URL: From loic.esteve at ymail.com Fri Apr 7 07:28:35 2017 From: loic.esteve at ymail.com (=?UTF-8?B?TG/Dr2MgRXN0w6h2ZQ==?=) Date: Fri, 7 Apr 2017 13:28:35 +0200 Subject: [scikit-learn] Bookmarklet to view documentation on CircleCI In-Reply-To: References: <20161221223338.GA334300@phare.normalesup.org> Message-ID: <5569ce34-c70d-5147-d247-ab003f5ce86e@ymail.com> On 12/22/2016 01:48 AM, Joel Nothman wrote: > Well, you can as a browser extension. I just haven't bothered to > investigate that technology when there's so much code to review and write. > > On 22 December 2016 at 09:33, Gael Varoquaux > > wrote: > > It's super neat. It's a pity that I don't see a way of integrating it to > the github interface. > I bit the bullet today and created a Userscript for Greasemonkey (on Chrome you can apparently use Tampermonkey) to add a button to the PR (at the moment right of the "Edit" button to edit the title) that pretty does the same thing as the bookmarklet. A snapshot is attached for clarity. This link to the raw gist should install it in your Greasemonkey user scripts: https://gist.github.com/lesteve/470170f288884ec052bcf4bc4ffe958a/raw/4270d0c731c3f1c797df3b014877b76d87b4e6bd/add_button_for_pr_circleci_doc.user.js Any comments/improvements, let me know. At the moment the button is only showed in the "Conversation" tab. Also note that for some reason that are beyond my web skills you may need to refresh the page to see the button, e.g. if you go first to https://github.com/scikit-learn/scikit-learn/pull/7995/files and then click on the "Conversation" page. It seems like the 'load' event is not triggered. Cheers, Lo?c -------------- next part -------------- A non-text attachment was scrubbed... Name: snapshot.png Type: image/png Size: 22886 bytes Desc: not available URL: From loic.esteve at ymail.com Fri Apr 7 07:30:25 2017 From: loic.esteve at ymail.com (=?UTF-8?B?TG/Dr2MgRXN0w6h2ZQ==?=) Date: Fri, 7 Apr 2017 13:30:25 +0200 Subject: [scikit-learn] Bookmarklet to view documentation on CircleCI In-Reply-To: References: <20161221223338.GA334300@phare.normalesup.org> Message-ID: On 12/22/2016 01:48 AM, Joel Nothman wrote: > Well, you can as a browser extension. I just haven't bothered to > investigate that technology when there's so much code to review and write. > > On 22 December 2016 at 09:33, Gael Varoquaux > > wrote: > > It's super neat. It's a pity that I don't see a way of integrating it to > the github interface. > I bit the bullet today and created a Userscript for Greasemonkey (on Chrome you can apparently use Tampermonkey) to add a button to the PR (at the moment right of the "Edit" button to edit the title) that pretty does the same thing as the bookmarklet. A snapshot is attached for clarity. This link to the raw gist should install it in your Greasemonkey user scripts: https://gist.github.com/lesteve/470170f288884ec052bcf4bc4ffe958a/raw/4270d0c731c3f1c797df3b014877b76d87b4e6bd/add_button_for_pr_circleci_doc.user.js Any comments/improvements, let me know. At the moment the button is only showed in the "Conversation" tab. Also note that for some reason that are beyond my web skills you may need to refresh the page to see the button, e.g. if you go first to https://github.com/scikit-learn/scikit-learn/pull/7995/files and then click on the "Conversation" page. It seems like the 'load' event is not triggered. Cheers, Lo?c -------------- next part -------------- A non-text attachment was scrubbed... Name: snapshot.png Type: image/png Size: 22886 bytes Desc: not available URL: From alessio.quaglino at usi.ch Fri Apr 7 04:06:17 2017 From: alessio.quaglino at usi.ch (Quaglino Alessio) Date: Fri, 7 Apr 2017 08:06:17 +0000 Subject: [scikit-learn] Alpha in GaussianProcessRegressor Message-ID: Hello, I am trying to understand if alpha is truly equivalent to WhiteKernel by looking at gpr.py. I can see that that the two are the same when fit() is called, i.e. self.L_ and self.alpha_ are the same whether alpha or WhiteKernel is used. In predict(), however, y_var = self.kernel_.diag(X) produces a different result depending on whether alpha or WhiteKernel is used. Is this correct? Indeed, if I run http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy.html the grey areas are completely different depending on which one I use, although the red and black curves are exactly the same. Thank you in advance! Regards, ------------------------------------------------- Dr. Alessio Quaglino Postdoctoral Researcher Institute of Computational Science Universit? della Svizzera Italiana -------------- next part -------------- An HTML attachment was scrubbed... URL: From manojkumarsivaraj334 at gmail.com Sun Apr 9 16:24:56 2017 From: manojkumarsivaraj334 at gmail.com (Manoj Kumar) Date: Sun, 9 Apr 2017 16:24:56 -0400 Subject: [scikit-learn] Alpha in GaussianProcessRegressor In-Reply-To: References: Message-ID: Hi Quaglino, You are right that at predict time both are not equivalent. More specifically, in Eq 2.23 in http://www.gaussianprocess. org/gpml/chapters/RW2.pdf 1. If you use a WhiteKernel, the first term becomes K^{hat}(X*, X) + \sigma^2 where K^{hat} is the kernel that you are using apart from the WhiteKernel and \sigma^2 is the noise term learnt by the WhiteKernel. 2. If you set noise to be alpha, the first term is just K^{hat}(X*, X) Thanks! On Fri, Apr 7, 2017 at 4:06 AM, Quaglino Alessio wrote: > Hello, > > I am trying to understand if alpha is truly equivalent to WhiteKernel by > looking at gpr.py. > > I can see that that the two are the same when fit() is called, i.e. > self.L_ and self.alpha_ are the same whether alpha or WhiteKernel is used. > > In predict(), however, y_var = self.kernel_.diag(X) produces a different > result depending on whether alpha or WhiteKernel is used. Is this correct? > Indeed, if I run http://scikit-learn.org/stable/auto_examples/gaussian_ > process/plot_gpr_noisy.html the grey areas are completely different > depending on which one I use, although the red and black curves are exactly > the same. > > Thank you in advance! > > Regards, > ------------------------------------------------- > Dr. Alessio Quaglino > Postdoctoral Researcher > Institute of Computational Science > Universit? della Svizzera Italiana > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Manoj, http://github.com/MechCoder -------------- next part -------------- An HTML attachment was scrubbed... URL: From s.atasever at gmail.com Mon Apr 10 07:29:35 2017 From: s.atasever at gmail.com (Sema Atasever) Date: Mon, 10 Apr 2017 14:29:35 +0300 Subject: [scikit-learn] sklearn.cluster.Birch In-Reply-To: References: Message-ID: Dear Manoj, Thanks for your answer but when i execute the code below i get this error : "*ValueError: could not convert string to float:*" How can i solve this error. Regards. --------------------------------------------- from sklearn.cluster import Birch from io import StringIO import numpy as np X=np.loadtxt(open("C:\features.dat", "rb"), delimiter="\t") brc = Birch(branching_factor=50, n_clusters=None, threshold=0.5,compute_labels=True) brc.fit(X) Birch(branching_factor=50, compute_labels=True, copy=True, n_clusters=None,threshold=0.5) brc.predict(X) print(brc.predict(X)) On Thu, Apr 6, 2017 at 4:29 PM, Manoj Kumar wrote: > Hi, See: https://docs.scipy.org/doc/numpy/reference/generated/ > numpy.loadtxt.html for one way. > > On Thu, Apr 6, 2017 at 8:27 AM, Sema Atasever > wrote: > >> Dear scikit-learn members, >> >> I have a dat file where the columns represent the features and the rows >> represent the protein. (you can see the dat file in the attachment) >> >> >> i want to use *sklearn.cluster.Birch* with this text file values as X >> which is the >> >> Parameters: >> X : {array-like, sparse matrix} >> >> How can i transform this text file values into X : {array-like, sparse >> matrix}. >> >> I would appreciate if you could advise on some methods. >> Thanks. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Manoj, > http://github.com/MechCoder > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sukru at sukrubezen.com Mon Apr 10 07:37:19 2017 From: sukru at sukrubezen.com (=?UTF-8?B?xZ7DvGtyw7wgQmV6ZW4=?=) Date: Mon, 10 Apr 2017 14:37:19 +0300 Subject: [scikit-learn] sklearn.cluster.Birch In-Reply-To: References: Message-ID: Hi, You may have one last separator at the end of every row in your data which makes your rows create one extra empty entity which results with that error. Or maybe your data can not be cast into float? If you provide an example row, I could help more. Best, On Mon, Apr 10, 2017 at 2:29 PM, Sema Atasever wrote: > Dear Manoj, > > Thanks for your answer but when i execute the code below i get this error > : "*ValueError: could not convert string to float:*" > > How can i solve this error. > > Regards. > > --------------------------------------------- > from sklearn.cluster import Birch > from io import StringIO > import numpy as np > > X=np.loadtxt(open("C:\features.dat", "rb"), delimiter="\t") > > > brc = Birch(branching_factor=50, n_clusters=None, > threshold=0.5,compute_labels=True) > brc.fit(X) > Birch(branching_factor=50, compute_labels=True, copy=True, > n_clusters=None,threshold=0.5) > brc.predict(X) > > print(brc.predict(X)) > > On Thu, Apr 6, 2017 at 4:29 PM, Manoj Kumar com> wrote: > >> Hi, See: https://docs.scipy.org/doc/numpy/reference/generated/numpy. >> loadtxt.html for one way. >> >> On Thu, Apr 6, 2017 at 8:27 AM, Sema Atasever >> wrote: >> >>> Dear scikit-learn members, >>> >>> I have a dat file where the columns represent the features and the rows >>> represent the protein. (you can see the dat file in the attachment) >>> >>> >>> i want to use *sklearn.cluster.Birch* with this text file values as X >>> which is the >>> >>> Parameters: >>> X : {array-like, sparse matrix} >>> >>> How can i transform this text file values into X : {array-like, sparse >>> matrix}. >>> >>> I would appreciate if you could advise on some methods. >>> Thanks. >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Manoj, >> http://github.com/MechCoder >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- -------------------------------------------------- ??kr? BEZEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From s.atasever at gmail.com Mon Apr 10 13:03:31 2017 From: s.atasever at gmail.com (Sema Atasever) Date: Mon, 10 Apr 2017 20:03:31 +0300 Subject: [scikit-learn] sklearn.cluster.Birch In-Reply-To: References: Message-ID: Hi ??kr?, I realized I needed to use space as a delimiter, problem solved. Thanks for your help, it is really appreciated. On Mon, Apr 10, 2017 at 2:37 PM, ??kr? Bezen wrote: > Hi, > > You may have one last separator at the end of every row in your data which > makes your rows create one extra empty entity which results with that error. > Or maybe your data can not be cast into float? > > If you provide an example row, I could help more. > > Best, > > On Mon, Apr 10, 2017 at 2:29 PM, Sema Atasever > wrote: > >> Dear Manoj, >> >> Thanks for your answer but when i execute the code below i get this error >> : "*ValueError: could not convert string to float:*" >> >> How can i solve this error. >> >> Regards. >> >> --------------------------------------------- >> from sklearn.cluster import Birch >> from io import StringIO >> import numpy as np >> >> X=np.loadtxt(open("C:\features.dat", "rb"), delimiter="\t") >> >> >> brc = Birch(branching_factor=50, n_clusters=None, >> threshold=0.5,compute_labels=True) >> brc.fit(X) >> Birch(branching_factor=50, compute_labels=True, copy=True, >> n_clusters=None,threshold=0.5) >> brc.predict(X) >> >> print(brc.predict(X)) >> >> On Thu, Apr 6, 2017 at 4:29 PM, Manoj Kumar < >> manojkumarsivaraj334 at gmail.com> wrote: >> >>> Hi, See: https://docs.scipy.org/doc/numpy/reference/generated/numpy.l >>> oadtxt.html for one way. >>> >>> On Thu, Apr 6, 2017 at 8:27 AM, Sema Atasever >>> wrote: >>> >>>> Dear scikit-learn members, >>>> >>>> I have a dat file where the columns represent the features and the rows >>>> represent the protein. (you can see the dat file in the attachment) >>>> >>>> >>>> i want to use *sklearn.cluster.Birch* with this text file values as X >>>> which is the >>>> >>>> Parameters: >>>> X : {array-like, sparse matrix} >>>> >>>> How can i transform this text file values into X : {array-like, sparse >>>> matrix}. >>>> >>>> I would appreciate if you could advise on some methods. >>>> Thanks. >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> Manoj, >>> http://github.com/MechCoder >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > -------------------------------------------------- > ??kr? BEZEN > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From surangakas at gmail.com Thu Apr 13 13:54:07 2017 From: surangakas at gmail.com (Suranga Kasthurirathne) Date: Thu, 13 Apr 2017 13:54:07 -0400 Subject: [scikit-learn] Random forest prediction probability value is limited to a single decimal point Message-ID: Hi all, I'm using scikit-learn to build a number of random forrest models using the default number of trees. However, when I print out the prediction probability ( http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba) for each outcome, its presented to me as a single decimal point (0.1, 0.2, 0.5 etc.). Only perhaps 5% of the data has more than a single decimal point. Is this normal behavior? is there some way I can increase the number of decimal points in the prediction probability outcomes? why arent I seeing more probabilities such as 0.231, 0.55551, 0.462156 etc.? -- Thanks and best Regards, Suranga -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Apr 13 14:41:15 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 13 Apr 2017 14:41:15 -0400 Subject: [scikit-learn] Random forest prediction probability value is limited to a single decimal point In-Reply-To: References: Message-ID: Hi, Have you tried to set numpy.set_printoptions(precision=8) ? Maybe that helps already. Best, Sebastian Sent from my iPhone > On Apr 13, 2017, at 1:54 PM, Suranga Kasthurirathne wrote: > > > Hi all, > > I'm using scikit-learn to build a number of random forrest models using the default number of trees. > > However, when I print out the prediction probability (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba) for each outcome, its presented to me as a single decimal point (0.1, 0.2, 0.5 etc.). Only perhaps 5% of the data has more than a single decimal point. > > Is this normal behavior? is there some way I can increase the number of decimal points in the prediction probability outcomes? why arent I seeing more probabilities such as 0.231, 0.55551, 0.462156 etc.? > > > -- > Thanks and best Regards, > Suranga > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Thu Apr 13 14:45:04 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 13 Apr 2017 20:45:04 +0200 Subject: [scikit-learn] Random forest prediction probability value is limited to a single decimal point In-Reply-To: References: Message-ID: <20170413184504.GB1973534@phare.normalesup.org> I would rather guess that this is related to a small n_estimators. I would try increasing n_estimators in the random forests. G On Thu, Apr 13, 2017 at 02:41:15PM -0400, Sebastian Raschka wrote: > Hi, > Have you tried to set numpy.set_printoptions(precision=8) ? Maybe that helps > already. > Best, > Sebastian > Sent from my iPhone > On Apr 13, 2017, at 1:54 PM, Suranga Kasthurirathne > wrote: > Hi all, > I'm using scikit-learn to build a number of random forrest models using the > default number of trees. > However, when I print out the prediction probability (http:// > scikit-learn.org/stable/modules/generated/ > sklearn.ensemble.RandomForestClassifier.html# > sklearn.ensemble.RandomForestClassifier.predict_proba) for each outcome, > its presented to me as a single decimal point (0.1, 0.2, 0.5 etc.). Only > perhaps 5% of the data has more than a single decimal point. > Is this normal behavior? is there some way I can increase the number of > decimal points in the prediction probability outcomes? why arent I seeing > more probabilities such as 0.231, 0.55551, 0.462156 etc.? -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From 26743610 at qq.com Thu Apr 13 16:58:44 2017 From: 26743610 at qq.com (=?gb18030?B?wM+zwg==?=) Date: Fri, 14 Apr 2017 04:58:44 +0800 Subject: [scikit-learn] How to dump a model to txt file? Message-ID: Hi, I am working on GradientBoostingRegressor these days and I am wondering if there is a way to dump the model into txt file, or any other format that can be processed by c++ My production system is in c++, so I want use the python-trained tree model in c++ for production. Has anyone ever done this before? thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Apr 13 17:23:12 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 13 Apr 2017 17:23:12 -0400 Subject: [scikit-learn] How to dump a model to txt file? In-Reply-To: References: Message-ID: <1791C501-DF55-4297-BE35-6BD1C147DD0E@gmail.com> Hi, not sure how this could generally work. However, you could at least dump the model parameters for e.g., linear models and compute the prediction via w_1 * x1 + w_2 * x_2 + ? + w_n * x_n + bias over the n features. To write various model attributes to text files, you could use json, e.g., see https://cmry.github.io/notes/serialize However, I don?t think that this approach will solve the problem of loading the model into C++. Best, Sebastian > On Apr 13, 2017, at 4:58 PM, ?? <26743610 at qq.com> wrote: > > Hi, > > I am working on GradientBoostingRegressor these days and I am wondering if there is a way to dump the model into txt file, or any other format that can be processed by c++ > > My production system is in c++, so I want use the python-trained tree model in c++ for production. > > Has anyone ever done this before? > > thanks > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From vaggi.federico at gmail.com Thu Apr 13 17:27:19 2017 From: vaggi.federico at gmail.com (federico vaggi) Date: Thu, 13 Apr 2017 21:27:19 +0000 Subject: [scikit-learn] How to dump a model to txt file? In-Reply-To: <1791C501-DF55-4297-BE35-6BD1C147DD0E@gmail.com> References: <1791C501-DF55-4297-BE35-6BD1C147DD0E@gmail.com> Message-ID: If you want to use the model from C++ code, the easiest way is to probably use Boost/Python ( http://www.boost.org/doc/libs/1_62_0/libs/python/doc/html/index.html). Alternatively, use another gradient boosting library that has a C++ API (like XGBoost). Keep in mind, if you want to call Python code from C++ you will have to bundle a Python interpreter as well as all the dependencies. On Thu, 13 Apr 2017 at 14:23 Sebastian Raschka wrote: > Hi, > > not sure how this could generally work. However, you could at least dump > the model parameters for e.g., linear models and compute the prediction via > > w_1 * x1 + w_2 * x_2 + ? + w_n * x_n + bias > > over the n features. > > To write various model attributes to text files, you could use json, e.g., > see https://cmry.github.io/notes/serialize > However, I don?t think that this approach will solve the problem of > loading the model into C++. > > Best, > Sebastian > > > On Apr 13, 2017, at 4:58 PM, ?? <26743610 at qq.com> wrote: > > > > Hi, > > > > I am working on GradientBoostingRegressor these days and I am wondering > if there is a way to dump the model into txt file, or any other format that > can be processed by c++ > > > > My production system is in c++, so I want use the python-trained tree > model in c++ for production. > > > > Has anyone ever done this before? > > > > thanks > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Fri Apr 14 03:28:43 2017 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Fri, 14 Apr 2017 09:28:43 +0200 Subject: [scikit-learn] How to dump a model to txt file? In-Reply-To: References: <1791C501-DF55-4297-BE35-6BD1C147DD0E@gmail.com> Message-ID: <8364fe2c-e151-8c3d-a0f7-27280ba56d29@gmail.com> Also, there is an effort on converting trained scikit-learn models to other languages (e.g. C) in https://github.com/nok/sklearn-porter but it does not support GradientBoostingRegressor (yet). On 13/04/17 23:27, federico vaggi wrote: > If you want to use the model from C++ code, the easiest way is to > probably use Boost/Python > (http://www.boost.org/doc/libs/1_62_0/libs/python/doc/html/index.html). > Alternatively, use another gradient boosting library that has a C++ API > (like XGBoost). > > Keep in mind, if you want to call Python code from C++ you will have to > bundle a Python interpreter as well as all the dependencies. > > On Thu, 13 Apr 2017 at 14:23 Sebastian Raschka > wrote: > > Hi, > > not sure how this could generally work. However, you could at least > dump the model parameters for e.g., linear models and compute the > prediction via > > w_1 * x1 + w_2 * x_2 + ? + w_n * x_n + bias > > over the n features. > > To write various model attributes to text files, you could use json, > e.g., see https://cmry.github.io/notes/serialize > However, I don?t think that this approach will solve the problem of > loading the model into C++. > > Best, > Sebastian > > > On Apr 13, 2017, at 4:58 PM, ?? <26743610 at qq.com > > wrote: > > > > Hi, > > > > I am working on GradientBoostingRegressor these days and I am > wondering if there is a way to dump the model into txt file, or any > other format that can be processed by c++ > > > > My production system is in c++, so I want use the python-trained > tree model in c++ for production. > > > > Has anyone ever done this before? > > > > thanks > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From t3kcit at gmail.com Fri Apr 14 11:17:12 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 14 Apr 2017 11:17:12 -0400 Subject: [scikit-learn] Random forest prediction probability value is limited to a single decimal point In-Reply-To: <20170413184504.GB1973534@phare.normalesup.org> References: <20170413184504.GB1973534@phare.normalesup.org> Message-ID: On 04/13/2017 02:45 PM, Gael Varoquaux wrote: > I would rather guess that this is related to a small n_estimators. I > would try increasing n_estimators in the random forests. > Yeah the default is too small for basically all applications. And without any regularization of the forest, each leaf will have 100% probability for one of the classes. From surangakas at gmail.com Fri Apr 14 11:27:18 2017 From: surangakas at gmail.com (Suranga Kasthurirathne) Date: Fri, 14 Apr 2017 11:27:18 -0400 Subject: [scikit-learn] Random forest prediction probability value is limited to a single decimal point In-Reply-To: References: <20170413184504.GB1973534@phare.normalesup.org> Message-ID: Hi there! Thank you, yea, it was the number of estimators. I was hoping that there was something easier that I could do, but apparently not! But anyways, thank you, this did solve the problem :) On Fri, Apr 14, 2017 at 11:17 AM, Andreas Mueller wrote: > > > On 04/13/2017 02:45 PM, Gael Varoquaux wrote: > >> I would rather guess that this is related to a small n_estimators. I >> would try increasing n_estimators in the random forests. >> >> Yeah the default is too small for basically all applications. > And without any regularization of the forest, each leaf will have 100% > probability for one of the classes. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Best Regards, Suranga -------------- next part -------------- An HTML attachment was scrubbed... URL: From varikvi at yahoo.com Sun Apr 16 05:56:58 2017 From: varikvi at yahoo.com (Evaristo Caraballo) Date: Sun, 16 Apr 2017 09:56:58 +0000 (UTC) Subject: [scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis? References: <244414034.1197197.1492336618451.ref@mail.yahoo.com> Message-ID: <244414034.1197197.1492336618451@mail.yahoo.com> I have been asked to implement a simple knn for text similarity analysis. I tried by using sklearn.neighbors module.The file to be analysed consisted on 2 relevant columns: "text" and "name".The knn model should be fitted with bag-of-words of a corpus of around 60,000 pre-treated text fragments of about 200 words each. I used CounterVectorizer.As test I was asked to use the model to get the names in the "name" column related to 10 top text strings that are the closest to a pre-selected one that also exists in the corpus used to initialise the knn model. Similarity distance should be measured using an euclidean metric.I used the kneighbors function to obtain the closest neighbors.Below you can find the code I was trying to implement using kneighbors:import os, sys import sklearn import sklearn.neighbors as sk_neighbors from sklearn.feature_extraction.text import CountVectorizer import pandas import scipy import matplotlib.pyplot as plt import numpy as np %matplotlib inline wiki = pandas.read_csv('wiki_filefragment.csv') mod_count_vect = CountVectorizer() count_vect = mod_count_vect.fit_transform(wiki['text']) print(count_vect.shape) mod_count_vect.get_feature_names() mod_enc = sklearn.preprocessing.LabelEncoder().fit(wiki['name']) enc = mod_enc.transform(wiki['name']) enc model = sk_neighbors.NearestNeighbors( n_neighbors=10, algorithm='brute', p = 2 ) #no matter what I use, it is always the same modelfit = model.fit(count_vect, enc) #also likely the kneighbors is not working? print( mod_enc.inverse_transform( modelfit.kneighbors( count_vect[mod_enc.transform( ['Franz Rottensteiner'] )], n_neighbors=11, return_distance=False ) ) )This implementation gave me the following results for the first 10 nearest neighbors to 'Franz Rottensteiner': Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III, Tofusquirrel , M. G. Sheftall, Peter Maurer, Allan Weisbecker, Ferdinand Knobloch, Andrea Foulkes, Alan W. Meerow, John Warner (writer) The results continued to be far from being close to the test solution (which use Graphlab Create and SFrame), which are: Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha, Andr%C3%A9 Hurst, Leslie R. Landrum, Andrew Pinsent, Alan W. Meerow, John Angus Campbell, Antonello Bonci, Henkjan Honing, Joseph Born Kadane In fact, I tried a simple brute force implementation by iterating over the list of texts calculating distances with scipy and that gave me the expected results. The result was the same after also using Python 2.7.A link to the implementations (the one that doesn't work and the one that does) together a pick the file used for this test?can be found on this Gist.Does anyone can suggest what it is wrong with my sklearn implementation?Relevant resources are: - Anaconda Python3.5 (with a virtenv using 2.7) - Jupyter - sklearn 0.18 - pandas -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmannhei at rams.colostate.edu Sun Apr 16 13:15:29 2017 From: jmannhei at rams.colostate.edu (Joshua Mannheimer) Date: Sun, 16 Apr 2017 11:15:29 -0600 Subject: [scikit-learn] Ordinary Least Square Regression Under-determined system. Message-ID: Hi all, So I am trying to write a Principle Components Regression implementation in Python to match the PLS package in R. I am getting better results in R so I am trying to figure out where the discrepancy was. The data I am using is way undetermined where n_features ~ 50,000 and n_samples ~ 500 thus why PCR is necessary. Just to see what would happen I used sklearn.LinearRegression on the original 500 X 50000 dataset. I expected I would get an error message stating the the system was not solvable but it worked and I got an answer that was at least on par with the PCR solution. So I am wondering how it is possibly solving this system if anybody knows. Thanks -- Joshua D. Mannheimer M.E. Biomedical Engineering Ph.d Student Flint Animal Cancer Research Center Office: A 259 CSU Veterinary Campus Colorado State University (970)-389-3951 jmannhei at rams.colostate.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Apr 18 06:15:27 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 18 Apr 2017 20:15:27 +1000 Subject: [scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis? In-Reply-To: <244414034.1197197.1492336618451@mail.yahoo.com> References: <244414034.1197197.1492336618451.ref@mail.yahoo.com> <244414034.1197197.1492336618451@mail.yahoo.com> Message-ID: towards debugging, perhaps add the return_distances option On 16 Apr 2017 9:19 pm, "Evaristo Caraballo via scikit-learn" < scikit-learn at python.org> wrote: > I have been asked to implement a simple knn for text similarity analysis. > I tried by using sklearn.neighbors module. > The file to be analysed consisted on 2 relevant columns: "text" and "name". > The knn model should be fitted with bag-of-words of a corpus of around > 60,000 pre-treated text fragments of about 200 words each. I used > CounterVectorizer. > As test I was asked to use the model to get the names in the "name" column > related to 10 top text strings that are the closest to a pre-selected one > that also exists in the corpus used to initialise the knn model. Similarity > distance should be measured using an euclidean metric. > I used the kneighbors function to obtain the closest neighbors. > Below you can find the code I was trying to implement using kneighbors: > > import os, sysimport sklearnimport sklearn.neighbors as sk_neighborsfrom sklearn.feature_extraction.text import CountVectorizerimport pandasimport scipyimport matplotlib.pyplot as pltimport numpy as np%matplotlib inline > > wiki = pandas.read_csv('wiki_filefragment.csv') > > mod_count_vect = CountVectorizer() > count_vect = mod_count_vect.fit_transform(wiki['text'])print(count_vect.shape) > mod_count_vect.get_feature_names() > > mod_enc = sklearn.preprocessing.LabelEncoder().fit(wiki['name']) > enc = mod_enc.transform(wiki['name']) > enc > > model = sk_neighbors.NearestNeighbors( n_neighbors=10, algorithm='brute', p = 2 ) #no matter what I use, it is always the same > modelfit = model.fit(count_vect, enc) > #also likely the kneighbors is not working?print( mod_enc.inverse_transform( modelfit.kneighbors( count_vect[mod_enc.transform( ['Franz Rottensteiner'] )], n_neighbors=11, return_distance=False ) ) ) > > This implementation gave me the following results for the first 10 nearest > neighbors to 'Franz Rottensteiner': > > Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III, Tofusquirrel > , M. G. Sheftall, Peter Maurer, Allan Weisbecker, Ferdinand Knobloch, > Andrea Foulkes, Alan W. Meerow, John Warner (writer) > > The results continued to be far from being close to the test solution > (which use Graphlab Create and SFrame), which are: > > Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha, Andr%C3%A9 > Hurst, Leslie R. Landrum, Andrew Pinsent, Alan W. Meerow, John Angus > Campbell, Antonello Bonci, Henkjan Honing, Joseph Born Kadane > > In fact, I tried a simple brute force implementation by iterating over the > list of texts calculating distances with scipy and that gave me the > expected results. The result was the same after also using Python 2.7. > A link to the implementations (the one that doesn't work and the one that > does) together a pick the file used for this test can be found on this > Gist . > Does anyone can suggest what it is wrong with my sklearn implementation? > Relevant resources are: - Anaconda Python3.5 (with a virtenv using 2.7) - > Jupyter - sklearn 0.18 - pandas > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From blrstartuphire at gmail.com Tue Apr 18 07:56:52 2017 From: blrstartuphire at gmail.com (Startup Hire) Date: Tue, 18 Apr 2017 17:26:52 +0530 Subject: [scikit-learn] XGboost Classifier error Message-ID: Hi!, I am trying to use XGBoost Classifer in RandomizedSearchCV as follows: clf = xgb.XGBClassifier() random_search_sg = RandomizedSearchCV(clf, param_distributions=params_dist, n_iter=n_iter_search, scoring=kappa_scorer, verbose=3, error_score=-1, fit_params=fit_params, n_jobs=-1) start = time() random_search_sg.fit(scaled_data, a_l) scaled_data = (0, 0) 4.53937223364 (0, 1) 4.08089927979 (0, 2) 5.08534158523 (0, 3) 0.900022077306 (0, 4) 0.582895703409 (0, 5) 3.52674131829 (0, 6) 2.00912587286 (0, 8) 1.06039501135 (0, 9) 4.8956331357 (0, 11) 1.51595206264 (0, 13) 3.00108387862 (0, 14) 0.0 (1, 0) 1.51312407788 (1, 1) 1.36029975993 (1, 2) 2.54267079261 (1, 3) 1.36638272336 (1, 4) 0.0225891281189 (1, 5) 3.52674131829 a_l = [1 0 0 ..., 0 0 0] (after using ravel) I am getting the error Python int too large to convert to C long while fitting the data using random_search_sg How to resolve this? Is this related to the formats of scaled data and a_l ? Regards, Sanant -------------- next part -------------- An HTML attachment was scrubbed... URL: From o.lyashevskaya at gmail.com Tue Apr 18 10:19:11 2017 From: o.lyashevskaya at gmail.com (Olga Lyashevska) Date: Tue, 18 Apr 2017 15:19:11 +0100 Subject: [scikit-learn] feature importance calculation in gradient boosting Message-ID: <7cc2f849-ba1e-09ef-2113-8272dc863780@gmail.com> Hi, I would like to understand how feature importances are calculated in gradient boosting regression. I know that these are the relevant functions: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L1165 https://github.com/scikit-learn/scikit-learn/blob/fc2f24927fc37d7e42917369f17de045b14c59b5/sklearn/tree/_tree.pyx#L1056 From the literature and elsewhere I understand that Gini impurity is calculated. What is this exactly and how does it relate to 'gain' vs 'frequency' implemented in XGBoost? http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html My problem is that when I fit exactly same model in sklearn and gbm (R package) I get different variable importance plots. One of the variables which was generated randomly (keeping all other variables real) appears to be very important in sklearn and very unimportant in gbm. How is this possible that completely random variable gets the highest importance? Many thanks, Olga From olivier.grisel at ensta.org Wed Apr 19 07:15:55 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 19 Apr 2017 13:15:55 +0200 Subject: [scikit-learn] XGboost Classifier error In-Reply-To: References: Message-ID: Please provide the full traceback. Without it it's impossible to tell whether the problem is in scikit-learn or xgboost. Also, please provide a minimal reproduction script as explained in: http://scikit-learn.org/stable/faq.html#what-s-the-best-way-to-get-help-on-scikit-learn-usage -- Olivier From blrstartuphire at gmail.com Thu Apr 20 00:21:37 2017 From: blrstartuphire at gmail.com (Startup Hire) Date: Thu, 20 Apr 2017 09:51:37 +0530 Subject: [scikit-learn] XGboost Classifier error In-Reply-To: References: Message-ID: Hi Olivier, Thanks for your info.I will follow it from now on. Details of traceback are given below: ----------Full traceback--------------- Fitting 3 folds for each of 10 candidates, totalling 30 fits C:\Users\ssampathkumar\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20. DeprecationWarning) ---------------------------------------------------------------------------OverflowError Traceback (most recent call last) in () 18 19 ---> 20 random_search_sg.fit(scaled_data, labels) 21 22 print("RandomizedSearchCV took %.2f seconds for %d candidates" C:\Users\ssampathkumar\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\grid_search.py in fit(self, X, y) 1023 self.n_iter, 1024 random_state=self.random_state)-> 1025 return self._fit(X, y, sampled_params) C:\Users\ssampathkumar\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\grid_search.py in _fit(self, X, y, parameter_iterable) 571 self.fit_params, return_parameters=True, 572 error_score=self.error_score)--> 573 for parameters in parameter_iterable 574 for train, test in cv) 575 C:\Users\ssampathkumar\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable) 756 # was dispatched. In particular this covers the edge 757 # case of Parallel used with an exhausted iterator.--> 758 while self.dispatch_one_batch(iterator): 759 self._iterating = True 760 else: C:\Users\ssampathkumar\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator) 601 602 with self._lock:--> 603 tasks = BatchedCalls(itertools.islice(iterator, batch_size)) 604 if len(tasks) == 0: 605 # No more tasks available in the iterator: tell caller to stop. C:\Users\ssampathkumar\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __init__(self, iterator_slice) 125 126 def __init__(self, iterator_slice):--> 127 self.items = list(iterator_slice) 128 self._size = len(self.items) 129 C:\Users\ssampathkumar\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\grid_search.py in (.0) 567 pre_dispatch=pre_dispatch 568 )(--> 569 delayed(_fit_and_score)(clone(base_estimator), X, y, self.scorer_, 570 train, test, self.verbose, parameters, 571 self.fit_params, return_parameters=True, C:\Users\ssampathkumar\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\grid_search.py in __iter__(self) 250 + " For exhaustive searches, use GridSearchCV.") 251 for i in sample_without_replacement(grid_size, self.n_iter,--> 252 random_state=rnd): 253 yield param_grid[i] 254 sklearn\utils\_random.pyx in sklearn.utils._random.sample_without_replacement (sklearn\utils\_random.c:3975)() OverflowError: Python int too large to convert to C long -------------------End of traceback----------------------------- Shape of scaled_data and labels are: (772330, 15) and (772330,) (I tried using scaled_data as CSR matrix as well as numpy array) btw, when I run it separately (without *randomizedsearchCV*), it works fine with the same dataset: ---- ---------------------------Code below runs fine------------------------------------- params_c = { 'n_estimators': 310, 'learning_rate': 0.1, 'min_child_weight': 5, 'max_depth': 10, 'gamma': 0, 'max_delta_step': 14, 'max_depth':5, 'subsample': 1, 'colsample_bytree': 1, 'colsample_bylevel': 1, 'reg_lambda': 1, 'reg_alpha': 0, 'scale_pos_weight': 1, 'objective': 'binary:logistic', 'silent': False, } c = xgb.XGBClassifier(**params_c) X_train, X_test, y_train, y_test = train_test_split(scaled_data, labels) from sklearn.metrics import confusion_matrix c.fit(X_train,y_train) y_pred = c.predict(X_test) cm3 = confusion_matrix(y_test, y_pred) print(cm3) ---------End of code that runs fine -------------------- On Wed, Apr 19, 2017 at 4:45 PM, Olivier Grisel wrote: > Please provide the full traceback. Without it it's impossible to tell > whether the problem is in scikit-learn or xgboost. > > Also, please provide a minimal reproduction script as explained in: > > http://scikit-learn.org/stable/faq.html#what-s-the- > best-way-to-get-help-on-scikit-learn-usage > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From urvesh.patel11 at gmail.com Thu Apr 20 00:51:49 2017 From: urvesh.patel11 at gmail.com (urvesh patel) Date: Thu, 20 Apr 2017 04:51:49 +0000 Subject: [scikit-learn] feature importance calculation in gradient boosting In-Reply-To: <7cc2f849-ba1e-09ef-2113-8272dc863780@gmail.com> References: <7cc2f849-ba1e-09ef-2113-8272dc863780@gmail.com> Message-ID: I believe your random variable by chance have some predictive power. In R, use Information package and check information value of that randomly created variable. If it is > 0.05 then it has good predictive power. On Tue, Apr 18, 2017 at 7:47 AM Olga Lyashevska wrote: > Hi, > > I would like to understand how feature importances are calculated in > gradient boosting regression. > > I know that these are the relevant functions: > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L1165 > > https://github.com/scikit-learn/scikit-learn/blob/fc2f24927fc37d7e42917369f17de045b14c59b5/sklearn/tree/_tree.pyx#L1056 > > From the literature and elsewhere I understand that Gini impurity is > calculated. What is this exactly and how does it relate to 'gain' vs > 'frequency' implemented in XGBoost? > http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html > > My problem is that when I fit exactly same model in sklearn and gbm (R > package) I get different variable importance plots. One of the variables > which was generated randomly (keeping all other variables real) appears > to be very important in sklearn and very unimportant in gbm. How is this > possible that completely random variable gets the highest importance? > > > Many thanks, > Olga > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From o.lyashevskaya at gmail.com Thu Apr 20 05:39:10 2017 From: o.lyashevskaya at gmail.com (Olga Lyashevska) Date: Thu, 20 Apr 2017 10:39:10 +0100 Subject: [scikit-learn] feature importance calculation in gradient boosting In-Reply-To: References: <7cc2f849-ba1e-09ef-2113-8272dc863780@gmail.com> Message-ID: <9a993b4f-4870-fd8b-1f07-4252a4d917fb@gmail.com> Thank you. It seems that information value can only be calculated for a binary classification dataset, however my response variable is continuous. On 20/04/17 05:51, urvesh patel wrote: > I believe your random variable by chance have some predictive power. In > R, use Information package and check information value of that randomly > created variable. If it is > 0.05 then it has good predictive power. > On Tue, Apr 18, 2017 at 7:47 AM Olga Lyashevska > > wrote: > > Hi, > > I would like to understand how feature importances are calculated in > gradient boosting regression. > > I know that these are the relevant functions: > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L1165 > https://github.com/scikit-learn/scikit-learn/blob/fc2f24927fc37d7e42917369f17de045b14c59b5/sklearn/tree/_tree.pyx#L1056 > > From the literature and elsewhere I understand that Gini impurity is > calculated. What is this exactly and how does it relate to 'gain' vs > 'frequency' implemented in XGBoost? > http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html > > My problem is that when I fit exactly same model in sklearn and gbm (R > package) I get different variable importance plots. One of the variables > which was generated randomly (keeping all other variables real) appears > to be very important in sklearn and very unimportant in gbm. How is this > possible that completely random variable gets the highest importance? > > > Many thanks, > Olga > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From alex at garel.org Thu Apr 20 05:58:07 2017 From: alex at garel.org (Alex Garel) Date: Thu, 20 Apr 2017 10:58:07 +0100 Subject: [scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis? In-Reply-To: <244414034.1197197.1492336618451@mail.yahoo.com> References: <244414034.1197197.1492336618451.ref@mail.yahoo.com> <244414034.1197197.1492336618451@mail.yahoo.com> Message-ID: <53423b8f-87f8-d0a0-09b1-b3fb98c143fe@garel.org> I'm not totally sure of what you're trying to do, but here are some remarks that may help you: 1. in modelfit = model.fit(count_vect, enc), the enc parameter is not used, only the count_vect matrix is used 2. when you use kneighbors you get vectors corresponding to wiki['text'] not to wiki['name'], so it seems very strange to use mod_enc.inverse_transform on it ! Maybe what you should better find those vectors in count_vect and read "name" at corresponding row in your dataframe. Hope it helps, Alex Le 16/04/2017 ? 10:56, Evaristo Caraballo via scikit-learn a ?crit : > I have been asked to implement a simple knn for text similarity > analysis. I tried by using sklearn.neighbors module. > The file to be analysed consisted on 2 relevant columns: "text" and > "name". > The knn model should be fitted with bag-of-words of a corpus of around > 60,000 pre-treated text fragments of about 200 words each. I used > CounterVectorizer. > As test I was asked to use the model to get the names in the "name" > column related to 10 top text strings that are the closest to a > pre-selected one that also exists in the corpus used to initialise the > knn model. Similarity distance should be measured using an euclidean > metric. > I used the kneighbors function to obtain the closest neighbors. > Below you can find the code I was trying to implement using kneighbors: > |importos,sys importsklearn importsklearn.neighbors assk_neighbors > fromsklearn.feature_extraction.text importCountVectorizerimportpandas > importscipy importmatplotlib.pyplot asplt importnumpy asnp %matplotlib > inline wiki =pandas.read_csv('wiki_filefragment.csv')mod_count_vect > =CountVectorizer()count_vect > =mod_count_vect.fit_transform(wiki['text'])print(count_vect.shape)mod_count_vect.get_feature_names()mod_enc > =sklearn.preprocessing.LabelEncoder().fit(wiki['name'])enc > =mod_enc.transform(wiki['name'])enc model > =sk_neighbors.NearestNeighbors(n_neighbors=10,algorithm='brute',p > =2)#no matter what I use, it is always the samemodelfit > =model.fit(count_vect,enc)#also likely the kneighbors is not > working?print(mod_enc.inverse_transform(modelfit.kneighbors(count_vect[mod_enc.transform(['Franz > Rottensteiner'])],n_neighbors=11,return_distance=False)))| > This implementation gave me the following results for the first 10 > nearest neighbors to 'Franz Rottensteiner': > > Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III, > Tofusquirrel , M. G. Sheftall, Peter Maurer, Allan Weisbecker, > Ferdinand Knobloch, Andrea Foulkes, Alan W. Meerow, John Warner > (writer) > > The results continued to be far from being close to the test solution > (which use Graphlab Create and SFrame), which are: > > Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha, > Andr%C3%A9 Hurst, Leslie R. Landrum, Andrew Pinsent, Alan W. > Meerow, John Angus Campbell, Antonello Bonci, Henkjan Honing, > Joseph Born Kadane > > In fact, I tried a simple brute force implementation by iterating over > the list of texts calculating distances with scipy and that gave me > the expected results. The result was the same after also using Python 2.7. > A link to the implementations (the one that doesn't work and the one > that does) together a pick the file used for this test can be found on > this Gist > . > Does anyone can suggest what it is wrong with my sklearn implementation? > Relevant resources are: - Anaconda Python3.5 (with a virtenv using > 2.7) - Jupyter - sklearn 0.18 - pandas > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: OpenPGP digital signature URL: From joel.nothman at gmail.com Thu Apr 20 06:46:49 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 20 Apr 2017 20:46:49 +1000 Subject: [scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis? In-Reply-To: <53423b8f-87f8-d0a0-09b1-b3fb98c143fe@garel.org> References: <244414034.1197197.1492336618451.ref@mail.yahoo.com> <244414034.1197197.1492336618451@mail.yahoo.com> <53423b8f-87f8-d0a0-09b1-b3fb98c143fe@garel.org> Message-ID: The problem is the misuse of the label encoder. See https://github.com/scikit-learn/scikit-learn/issues/8767 On 20 April 2017 at 19:58, Alex Garel wrote: > I'm not totally sure of what you're trying to do, but here are some > remarks that may help you: > > 1. in modelfit = model.fit(count_vect, enc), the enc parameter is not > used, only the count_vect matrix is used > 2. when you use kneighbors you get vectors corresponding to wiki['text'] > not to wiki['name'], so it seems very strange to use > mod_enc.inverse_transform on it ! > > Maybe what you should better find those vectors in count_vect and read > "name" at corresponding row in your dataframe. > > Hope it helps, > > Alex > > > Le 16/04/2017 ? 10:56, Evaristo Caraballo via scikit-learn a ?crit : > > I have been asked to implement a simple knn for text similarity analysis. > I tried by using sklearn.neighbors module. > The file to be analysed consisted on 2 relevant columns: "text" and "name". > The knn model should be fitted with bag-of-words of a corpus of around > 60,000 pre-treated text fragments of about 200 words each. I used > CounterVectorizer. > As test I was asked to use the model to get the names in the "name" column > related to 10 top text strings that are the closest to a pre-selected one > that also exists in the corpus used to initialise the knn model. Similarity > distance should be measured using an euclidean metric. > I used the kneighbors function to obtain the closest neighbors. > Below you can find the code I was trying to implement using kneighbors: > > import os, sysimport sklearnimport sklearn.neighbors as sk_neighborsfrom sklearn.feature_extraction.text import CountVectorizerimport pandasimport scipyimport matplotlib.pyplot as pltimport numpy as np%matplotlib inline > > wiki = pandas.read_csv('wiki_filefragment.csv') > > mod_count_vect = CountVectorizer() > count_vect = mod_count_vect.fit_transform(wiki['text'])print(count_vect.shape) > mod_count_vect.get_feature_names() > > mod_enc = sklearn.preprocessing.LabelEncoder().fit(wiki['name']) > enc = mod_enc.transform(wiki['name']) > enc > > model = sk_neighbors.NearestNeighbors( n_neighbors=10, algorithm='brute', p = 2 ) #no matter what I use, it is always the same > modelfit = model.fit(count_vect, enc) > #also likely the kneighbors is not working?print( mod_enc.inverse_transform( modelfit.kneighbors( count_vect[mod_enc.transform( ['Franz Rottensteiner'] )], n_neighbors=11, return_distance=False ) ) ) > > This implementation gave me the following results for the first 10 nearest > neighbors to 'Franz Rottensteiner': > > Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III, Tofusquirrel > , M. G. Sheftall, Peter Maurer, Allan Weisbecker, Ferdinand Knobloch, > Andrea Foulkes, Alan W. Meerow, John Warner (writer) > > The results continued to be far from being close to the test solution > (which use Graphlab Create and SFrame), which are: > > Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha, Andr%C3%A9 > Hurst, Leslie R. Landrum, Andrew Pinsent, Alan W. Meerow, John Angus > Campbell, Antonello Bonci, Henkjan Honing, Joseph Born Kadane > > In fact, I tried a simple brute force implementation by iterating over the > list of texts calculating distances with scipy and that gave me the > expected results. The result was the same after also using Python 2.7. > A link to the implementations (the one that doesn't work and the one that > does) together a pick the file used for this test can be found on this > Gist . > Does anyone can suggest what it is wrong with my sklearn implementation? > Relevant resources are: - Anaconda Python3.5 (with a virtenv using 2.7) - > Jupyter - sklearn 0.18 - pandas > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From varikvi at yahoo.com Fri Apr 21 05:59:03 2017 From: varikvi at yahoo.com (Evaristo Caraballo) Date: Fri, 21 Apr 2017 09:59:03 +0000 (UTC) Subject: [scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis? In-Reply-To: References: <244414034.1197197.1492336618451.ref@mail.yahoo.com> <244414034.1197197.1492336618451@mail.yahoo.com> <53423b8f-87f8-d0a0-09b1-b3fb98c143fe@garel.org> Message-ID: <554141200.5727854.1492768743713@mail.yahoo.com> Indeed, Joel: you are totally right. I absolutely misinterpreted the use of the encoder. Thanks Joel and Alex for having a look! El Jueves, 20 de abril, 2017 12:46:52, Joel Nothman escribi?: The problem is the misuse of the label encoder. See?https://github.com/scikit-learn/scikit-learn/issues/8767 On 20 April 2017 at 19:58, Alex Garel wrote: I'm not totally sure of what you're trying to do, but here are some remarks that may help you: 1. in modelfit = model.fit(count_vect, enc), the enc parameter is not used, only the count_vect matrix is used 2. when you use kneighbors you get vectors corresponding to wiki['text'] not to wiki['name'], so it seems very strange to use mod_enc.inverse_transform on it ! Maybe what you should better find those vectors in count_vect and read "name" at corresponding row in your dataframe. Hope it helps, Alex Le 16/04/2017 ? 10:56, Evaristo Caraballo via scikit-learn a ?crit?: I have been asked to implement a simple knn for text similarity analysis. I tried by using sklearn.neighbors module. The file to be analysed consisted on 2 relevant columns: "text" and "name". The knn model should be fitted with bag-of-words of a corpus of around 60,000 pre-treated text fragments of about 200 words each. I used CounterVectorizer. As test I was asked to use the model to get the names in the "name" column related to 10 top text strings that are the closest to a pre-selected one that also exists in the corpus used to initialise the knn model. Similarity distance should be measured using an euclidean metric. I used the kneighbors function to obtain the closest neighbors. Below you can find the code I was trying to implement using kneighbors: import os, sys import sklearn import sklearn.neighbors as sk_neighbors from sklearn.feature_extraction.tex t import CountVectorizer import pandas import scipy import matplotlib.pyplot as plt import numpy as np %matplotlib inline wiki = pandas.read_csv('wiki_ filefragment.csv') mod_count_vect = CountVectorizer() count_vect = mod_count_vect.fit_transform(w iki['text']) print(count_vect.shape) mod_count_vect.get_feature_ names() mod_enc = sklearn.preprocessing.LabelEnc oder().fit(wiki['name']) enc = mod_enc.transform(wiki['name'] ) enc model = sk_neighbors.NearestNeighbors( n_neighbors=10, algorithm='brute', p = 2 ) #no matter what I use, it is always the same modelfit = model.fit(count_vect, enc) #also likely the kneighbors is not working? print( mod_enc.inverse_transform( modelfit.kneighbors( count_vect[mod_enc.transform( ['Franz Rottensteiner'] )], n_neighbors=11, return_distance=False ) ) ) This implementation gave me the following results for the first 10 nearest neighbors to 'Franz Rottensteiner': Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III, Tofusquirrel , M. G. Sheftall, Peter Maurer, Allan Weisbecker, Ferdinand Knobloch, Andrea Foulkes, Alan W. Meerow, John Warner (writer) The results continued to be far from being close to the test solution (which use Graphlab Create and SFrame), which are: Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha, Andr%C3%A9 Hurst, Leslie R. Landrum, Andrew Pinsent, Alan W. Meerow, John Angus Campbell, Antonello Bonci, Henkjan Honing, Joseph Born Kadane In fact, I tried a simple brute force implementation by iterating over the list of texts calculating distances with scipy and that gave me the expected results. The result was the same after also using Python 2.7. A link to the implementations (the one that doesn't work and the one that does) together a pick the file used for this test?can be found on this Gist. Does anyone can suggest what it is wrong with my sklearn implementation? Relevant resources are: - Anaconda Python3.5 (with a virtenv using 2.7) - Jupyter - sklearn 0.18 - pandas ______________________________ _________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/ mailman/listinfo/scikit-learn ______________________________ _________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/ mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From surangakas at gmail.com Sun Apr 23 11:50:49 2017 From: surangakas at gmail.com (Suranga Kasthurirathne) Date: Sun, 23 Apr 2017 11:50:49 -0400 Subject: [scikit-learn] What if I don't want performance measures per each outcome class? Message-ID: Hello all, I'm looking at the confidence matrix and performance measures (precision, recall, f-measure etc.) produced by scikit. It seems that scikit calculates these measures per each outcome class, and then combines them into some sort of average. I would really like to see these measures presented in the traditional(?) context, where sensitivity is TP / TP + FN. (and is combined, and NOT per class!) If I were to take scikit predictions, and calculate sensitivity using the above, then my results wont match up to what scikit says :( How can I switch to seeing overall performance measures, and not per class? and also, how may I obtain 95% confidence intervals foreach of these measures? -- Best Regards, Suranga -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Apr 24 06:55:25 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 24 Apr 2017 20:55:25 +1000 Subject: [scikit-learn] What if I don't want performance measures per each outcome class? In-Reply-To: References: Message-ID: "Traditional" sensitivity is defined for binary classification only. Maybe micro-average is what you're looking for, but in the multiclass case without anything more specified, you'll merely be calculating accuracy. Perhaps quantiles of the scores returned by permutation_test_score will give you the CIs you seek. On 24 April 2017 at 01:50, Suranga Kasthurirathne wrote: > > Hello all, > > I'm looking at the confidence matrix and performance measures (precision, > recall, f-measure etc.) produced by scikit. > > It seems that scikit calculates these measures per each outcome class, and > then combines them into some sort of average. > > I would really like to see these measures presented in the traditional(?) > context, where sensitivity is TP / TP + FN. (and is combined, and NOT per > class!) > > If I were to take scikit predictions, and calculate sensitivity using the > above, then my results wont match up to what scikit says :( > > How can I switch to seeing overall performance measures, and not per > class? and also, how may I obtain 95% confidence intervals foreach of these > measures? > > -- > Best Regards, > Suranga > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon Apr 24 08:25:22 2017 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Mon, 24 Apr 2017 14:25:22 +0200 Subject: [scikit-learn] Scikit-learn sprint @Paris 6-10 June Message-ID: <87y3uq3sct.fsf@lemaitre-HP-EliteBook-840-G3> Dear all, As previously mentioned, the scikit-learn community will gather for a coding sprint in Paris from the 6th to the 10th of June, ahead of the PyParis conference. All information regarding this event is available on the following wiki page: https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events Do no hesitate to consult this page regularly; any new information will be published there. This event is also opened to new contributors which can refer to the above link. Any participant will have to add his presence by editing the wiki page. Hope to see you soon in Paris. -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ From t3kcit at gmail.com Tue Apr 25 12:40:24 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 25 Apr 2017 12:40:24 -0400 Subject: [scikit-learn] Ordinary Least Square Regression Under-determined system. In-Reply-To: References: Message-ID: PLS is not the same as PCR, right? Why did you expect them to perform the same? LinearRegression is just calling scipy.linalg.lstsq https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html as you can see here: https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/linear_model/base.py#L539 I expect that produces the minimum norm solution though I'm not familiar with the exact solver. On 04/16/2017 01:15 PM, Joshua Mannheimer wrote: > Hi all, > > So I am trying to write a Principle Components Regression > implementation in Python to match the PLS package in R. I am getting > better results in R so I am trying to figure out where the discrepancy > was. The data I am using is way undetermined where n_features ~ 50,000 > and n_samples ~ 500 thus why PCR is necessary. Just to see what would > happen I used sklearn.LinearRegression on the original 500 X 50000 > dataset. I expected I would get an error message stating the the > system was not solvable but it worked and I got an answer that was at > least on par with the PCR solution. So I am wondering how it is > possibly solving this system if anybody knows. Thanks > > -- > Joshua D. Mannheimer M.E. > Biomedical Engineering Ph.d Student > Flint Animal Cancer Research Center > Office: A 259 CSU Veterinary Campus > Colorado State University > (970)-389-3951 > jmannhei at rams.colostate.edu > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Tue Apr 25 13:14:06 2017 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Tue, 25 Apr 2017 13:14:06 -0400 Subject: [scikit-learn] Ordinary Least Square Regression Under-determined system. In-Reply-To: References: Message-ID: scipy.linalg.leastq uses an SVD solver and drops singular components, where singular depends on the condition number threshold. So it's equivalent to PCR with a tiny threshold for dropping components (rcond < 1e-15, if it's similar to numpy). SVD/rcond is on original, not on standardized variables. The PCA in PCR is not supervised in contrast to PLS (at least in general, I have no idea about scikit-learn versions) Josef On Tue, Apr 25, 2017 at 12:40 PM, Andreas Mueller wrote: > PLS is not the same as PCR, right? Why did you expect them to perform the > same? > LinearRegression is just calling scipy.linalg.lstsq > https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html > as you can see here: > https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/linear_model/base.py#L539 > I expect that produces the minimum norm solution though I'm not familiar > with the exact solver. > > > > > On 04/16/2017 01:15 PM, Joshua Mannheimer wrote: > > Hi all, > > So I am trying to write a Principle Components Regression implementation in > Python to match the PLS package in R. I am getting better results in R so I > am trying to figure out where the discrepancy was. The data I am using is > way undetermined where n_features ~ 50,000 and n_samples ~ 500 thus why PCR > is necessary. Just to see what would happen I used sklearn.LinearRegression > on the original 500 X 50000 dataset. I expected I would get an error message > stating the the system was not solvable but it worked and I got an answer > that was at least on par with the PCR solution. So I am wondering how it is > possibly solving this system if anybody knows. Thanks > > -- > Joshua D. Mannheimer M.E. > Biomedical Engineering Ph.d Student > Flint Animal Cancer Research Center > Office: A 259 CSU Veterinary Campus > Colorado State University > (970)-389-3951 > jmannhei at rams.colostate.edu > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From isaac.laughlin at gmail.com Wed Apr 26 18:54:26 2017 From: isaac.laughlin at gmail.com (Isaac Laughlin) Date: Wed, 26 Apr 2017 22:54:26 +0000 Subject: [scikit-learn] Plotting Code in Examples Pull Request Message-ID: I'm hoping to get some more comments on this pull request: https://github.com/scikit-learn/scikit-learn/pull/8490 I've got some availability and would love to get this resolved and start working on some others if possible. Thanks in advance! -------------- next part -------------- An HTML attachment was scrubbed... URL: From a_lago at hotmail.com Thu Apr 27 09:44:21 2017 From: a_lago at hotmail.com (andres lago) Date: Thu, 27 Apr 2017 13:44:21 +0000 Subject: [scikit-learn] Contribution to sklearn: Cross validation of time series In-Reply-To: References: Message-ID: Hello, I'd like to contribute with a new functionality in sklearn. It's the cross validation of time series. It's an evolution of the current functionality, implemented by TimeSeriesSplit. TimeSeriesSplit only allows the user to set the number of folds. In real life, when performing the cross validation of time series, other parameters are required, for instance: -minimum size of CV-training set -size of CV-test set -fixed or variable length of CV-training set. The functionality is inspired by the R library 'caret'. If you agree, I can share my code. I developed it for a project with the french rail company SNCF. It's in production now. Regards, Andres -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Apr 28 11:48:26 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 28 Apr 2017 11:48:26 -0400 Subject: [scikit-learn] Contribution to sklearn: Cross validation of time series In-Reply-To: References: Message-ID: Hey Andres. I think there might be a PR for that. Can you explain the minimum size of the training set? How is that used? I thought the other main option would be "rolling window" cross validation to use a fixed length cv training set. So the two options to me were rolling window and what we're doing right now. Can you elaborate on the other use cases, like minimum size of the training set and why you would want the other options with a variable length training set? Thanks, Andy On 04/27/2017 09:44 AM, andres lago wrote: > > Hello, > > I'd like to contribute with a new functionality in sklearn. It's the > cross validation of time series. It's an evolution of the > current functionality, implemented by TimeSeriesSplit. > > > TimeSeriesSplit only allows the user to set the number of folds. In > real life, when performing the cross validation of time series, other > parameters are required, for instance: > > -minimum size of CV-training set > > -size of CV-test set > > -fixed or variable length of CV-training set. > > > The functionality is inspired by the R library 'caret'. > > > If you agree, I can share my code. I developed it for a project with > the french rail company SNCF. It's in production now. > > > Regards, > > Andres > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From sylvain.marchienne at gmail.com Fri Apr 28 12:13:05 2017 From: sylvain.marchienne at gmail.com (Sylvain Marchienne) Date: Fri, 28 Apr 2017 18:13:05 +0200 Subject: [scikit-learn] Contribution to sklearn: Cross validation of time series In-Reply-To: References: Message-ID: <39B7D64D-6FF7-4D65-88EC-22FFF4E3F2AD@gmail.com> Hi Andres, hi Andy, Indeed in real life I also needed to cross-validate time series in a different manner than TimeSeriesSplit implemented in sklearn does. I fully support the idea of such a contribution Andres. As Andy mentioned, the main option would be a ? rolling window ? or as I use to say, a ? sliding window ? technique. I think this is what you meant. In order to understand each other, I propose to give a piece of explanation: Think about your data sorted by time chronologically on an axis. Set a constant test set length (interval) which will ? slide ? over the time. Then the training set is just the rest of the data before the first one in test set. I joined a slide I used during a presentation of that principle. Andy, probably it wasn?t your exact idea but I think it?s kind of. Thanks, Sylvain > Le 28 avr. 2017 ? 17:48, Andreas Mueller a ?crit : > > Hey Andres. > I think there might be a PR for that. > Can you explain the minimum size of the training set? How is that used? > I thought the other main option would be "rolling window" cross validation > to use a fixed length cv training set. > > So the two options to me were rolling window and what we're doing right now. > Can you elaborate on the other use cases, like minimum size of the training set > and why you would want the other options with a variable length training set? > > Thanks, > Andy > > On 04/27/2017 09:44 AM, andres lago wrote: >> Hello, >> I'd like to contribute with a new functionality in sklearn. It's the cross validation of time series. It's an evolution of the current functionality, implemented by TimeSeriesSplit. >> >> TimeSeriesSplit only allows the user to set the number of folds. In real life, when performing the cross validation of time series, other parameters are required, for instance: >> -minimum size of CV-training set >> -size of CV-test set >> -fixed or variable length of CV-training set. >> >> The functionality is inspired by the R library 'caret'. >> >> If you agree, I can share my code. I developed it for a project with the french rail company SNCF. It's in production now. >> >> Regards, >> Andres >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Sliding_window.jpg Type: image/jpeg Size: 32063 bytes Desc: not available URL: From a_lago at hotmail.com Fri Apr 28 12:26:56 2017 From: a_lago at hotmail.com (andres lago) Date: Fri, 28 Apr 2017 16:26:56 +0000 Subject: [scikit-learn] Contribution to sklearn: Cross validation of time series In-Reply-To: References: , Message-ID: Hi Andy, I'll try to be more precise with the CV I'm proposing. Comparing to the actual TimeSeriesSplit, these would be the new parameters: -Rolling window Or Variable length window: Rolling window mode keeps the CV-training set with the same size for all folds, shifting forward at each iteration of CV. Variable length window mode increments the size of CV-training set at each fold iteration (actual implementation in TimeSeriesSplit) ________________________________ De: scikit-learn en nombre de Andreas Mueller Enviado: viernes, 28 de abril de 2017 05:48 p. m. Para: Scikit-learn user and developer mailing list Asunto: Re: [scikit-learn] Contribution to sklearn: Cross validation of time series Hey Andres. I think there might be a PR for that. Can you explain the minimum size of the training set? How is that used? I thought the other main option would be "rolling window" cross validation to use a fixed length cv training set. So the two options to me were rolling window and what we're doing right now. Can you elaborate on the other use cases, like minimum size of the training set and why you would want the other options with a variable length training set? Thanks, Andy On 04/27/2017 09:44 AM, andres lago wrote: Hello, I'd like to contribute with a new functionality in sklearn. It's the cross validation of time series. It's an evolution of the current functionality, implemented by TimeSeriesSplit. TimeSeriesSplit only allows the user to set the number of folds. In real life, when performing the cross validation of time series, other parameters are required, for instance: -minimum size of CV-training set -size of CV-test set -fixed or variable length of CV-training set. The functionality is inspired by the R library 'caret'. If you agree, I can share my code. I developed it for a project with the french rail company SNCF. It's in production now. Regards, Andres _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From a_lago at hotmail.com Fri Apr 28 13:31:56 2017 From: a_lago at hotmail.com (andres lago) Date: Fri, 28 Apr 2017 17:31:56 +0000 Subject: [scikit-learn] Contribution to sklearn: Cross validation of time series In-Reply-To: References: , , Message-ID: Hi Andy, sorry, I pushed an unwanted 'send' in the previous message. Thanks for your quick reply. I'll try to be more precise with the CV I'm proposing. Comparing to the actual implementation (TimeSeriesSplit), these would be the new parameters: 1-CV mode: Rolling window Or Variable length window: > Rolling window: keeps the same size of CV-training set for all folds, shifting forward at each iteration of CV. > Variable length window: increments the size of CV-training set at each fold iteration (actual implementation in TimeSeriesSplit). 2-minimum size of CV-training set: Initial size of CV-training set. It's the minimum number of observations required to do the first predictions. 3- size of CV-test set: Size of the CV-test set. It's constant for all folds. Should have the size of the prediction horizon. The number of folds is not required anymore, it's automatically calculated from fields 2 & 3. The idea behind this contribution is to cover some common use cases around the CV that today is impossible with TimeSeriesSplit: -Your data doesn't show seasonality, your dataset is huge then you'd like to perform CV with a rolling window to accelerate the CV -The client asked for a prediction horizon of 7 days, you'd like to perform the tests in CV with this horizon -The data has a strong seasonality, you want to fit at least 1 month of observations before the first prediction in CV Please find enclosed some graphics to ease understanding the proposal. Regards, Andr?s [cid:74c7be1f-c32d-4897-b4ca-189ef02f308a] [cid:b8367fba-fd67-4daa-8ddf-61774727daba] ________________________________ De: scikit-learn en nombre de Andreas Mueller Enviado: viernes, 28 de abril de 2017 05:48 p. m. Para: Scikit-learn user and developer mailing list Asunto: Re: [scikit-learn] Contribution to sklearn: Cross validation of time series Hey Andres. I think there might be a PR for that. Can you explain the minimum size of the training set? How is that used? I thought the other main option would be "rolling window" cross validation to use a fixed length cv training set. So the two options to me were rolling window and what we're doing right now. Can you elaborate on the other use cases, like minimum size of the training set and why you would want the other options with a variable length training set? Thanks, Andy On 04/27/2017 09:44 AM, andres lago wrote: Hello, I'd like to contribute with a new functionality in sklearn. It's the cross validation of time series. It's an evolution of the current functionality, implemented by TimeSeriesSplit. TimeSeriesSplit only allows the user to set the number of folds. In real life, when performing the cross validation of time series, other parameters are required, for instance: -minimum size of CV-training set -size of CV-test set -fixed or variable length of CV-training set. The functionality is inspired by the R library 'caret'. If you agree, I can share my code. I developed it for a project with the french rail company SNCF. It's in production now. Regards, Andres _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pastedImage.png Type: image/png Size: 104405 bytes Desc: pastedImage.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pastedImage.png Type: image/png Size: 48958 bytes Desc: pastedImage.png URL: From noflaco at gmail.com Sat Apr 29 22:59:23 2017 From: noflaco at gmail.com (Carlton Banks) Date: Sun, 30 Apr 2017 04:59:23 +0200 Subject: [scikit-learn] gridsearchCV able to handle list of input? Message-ID: I am currently trying to run some gridsearchCV on a keras model which has multiple inputs. The inputs is stored in a list in which each entry in the list is a input for a specific channel. Here is my model and how i use the gridsearch. https://pastebin.com/GMKH1L80 The error i am getting is: https://pastebin.com/A3cB0rMv Any idea how i can resolve this? -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Apr 30 06:02:50 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 30 Apr 2017 20:02:50 +1000 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: References: Message-ID: What are the shapes of train_input and train_output? On 30 April 2017 at 12:59, Carlton Banks wrote: > I am currently trying to run some gridsearchCV on a keras model which has > multiple inputs. > The inputs is stored in a list in which each entry in the list is a input > for a specific channel. > > > Here is my model and how i use the gridsearch. > > https://pastebin.com/GMKH1L80 > > The error i am getting is: > > https://pastebin.com/A3cB0rMv > > Any idea how i can resolve this? > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julio at esbet.es Sun Apr 30 06:11:21 2017 From: julio at esbet.es (Julio Antonio Soto de Vicente) Date: Sun, 30 Apr 2017 12:11:21 +0200 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: References: Message-ID: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> Tbh I've never tried, but I would say that te current sklearn API does not support multi-input data... > El 30 abr 2017, a las 12:02, Joel Nothman escribi?: > > What are the shapes of train_input and train_output? > >> On 30 April 2017 at 12:59, Carlton Banks wrote: >> I am currently trying to run some gridsearchCV on a keras model which has multiple inputs. >> The inputs is stored in a list in which each entry in the list is a input for a specific channel. >> >> >> Here is my model and how i use the gridsearch. >> >> https://pastebin.com/GMKH1L80 >> >> The error i am getting is: >> >> https://pastebin.com/A3cB0rMv >> >> Any idea how i can resolve this? >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Apr 30 06:57:42 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 30 Apr 2017 20:57:42 +1000 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> Message-ID: Scikit-learn should accept a list as X to grid search and index it just fine. So I'm not sure that constraint applies to Grid Search On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente wrote: > Tbh I've never tried, but I would say that te current sklearn API does not > support multi-input data... > > El 30 abr 2017, a las 12:02, Joel Nothman > escribi?: > > What are the shapes of train_input and train_output? > > On 30 April 2017 at 12:59, Carlton Banks wrote: > >> I am currently trying to run some gridsearchCV on a keras model which has >> multiple inputs. >> The inputs is stored in a list in which each entry in the list is a input >> for a specific channel. >> >> >> Here is my model and how i use the gridsearch. >> >> https://pastebin.com/GMKH1L80 >> >> The error i am getting is: >> >> https://pastebin.com/A3cB0rMv >> >> Any idea how i can resolve this? >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Sun Apr 30 08:18:10 2017 From: noflaco at gmail.com (Carlton Banks) Date: Sun, 30 Apr 2017 14:18:10 +0200 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> Message-ID: <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> The shapes are print len(train_input) print train_input[0].shape print train_output.shape 33 (100, 8, 45, 3) (100, 1, 145) 100 is the batch-size.. > Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman : > > Scikit-learn should accept a list as X to grid search and index it just fine. So I'm not sure that constraint applies to Grid Search > > On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente > wrote: > Tbh I've never tried, but I would say that te current sklearn API does not support multi-input data... > > El 30 abr 2017, a las 12:02, Joel Nothman > escribi?: > >> What are the shapes of train_input and train_output? >> >> On 30 April 2017 at 12:59, Carlton Banks > wrote: >> I am currently trying to run some gridsearchCV on a keras model which has multiple inputs. >> The inputs is stored in a list in which each entry in the list is a input for a specific channel. >> >> >> Here is my model and how i use the gridsearch. >> >> https://pastebin.com/GMKH1L80 >> >> The error i am getting is: >> >> https://pastebin.com/A3cB0rMv >> >> Any idea how i can resolve this? >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Sun Apr 30 09:23:53 2017 From: noflaco at gmail.com (Carlton Banks) Date: Sun, 30 Apr 2017 15:23:53 +0200 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> Message-ID: <4945250F-6B85-407B-A0F7-690DB054C28F@gmail.com> It seems like scikit-learn is not able to handle network with multiple inputs. Keras documentation states: You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers found at keras.wrappers.scikit_learn.py. But besides what the wrapper can do.. can scikit-learn really not handle multiple inputs?.. > Den 30. apr. 2017 kl. 14.18 skrev Carlton Banks : > > The shapes are > > print len(train_input) > print train_input[0].shape > print train_output.shape > > 33 > (100, 8, 45, 3) > (100, 1, 145) > > 100 is the batch-size.. >> Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman >: >> >> Scikit-learn should accept a list as X to grid search and index it just fine. So I'm not sure that constraint applies to Grid Search >> >> On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente > wrote: >> Tbh I've never tried, but I would say that te current sklearn API does not support multi-input data... >> >> El 30 abr 2017, a las 12:02, Joel Nothman > escribi?: >> >>> What are the shapes of train_input and train_output? >>> >>> On 30 April 2017 at 12:59, Carlton Banks > wrote: >>> I am currently trying to run some gridsearchCV on a keras model which has multiple inputs. >>> The inputs is stored in a list in which each entry in the list is a input for a specific channel. >>> >>> >>> Here is my model and how i use the gridsearch. >>> >>> https://pastebin.com/GMKH1L80 >>> >>> The error i am getting is: >>> >>> https://pastebin.com/A3cB0rMv >>> >>> Any idea how i can resolve this? >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From george at georgefisher.com Sun Apr 30 15:13:52 2017 From: george at georgefisher.com (George Fisher) Date: Sun, 30 Apr 2017 15:13:52 -0400 Subject: [scikit-learn] RFE/RFECV parameter suggestion Message-ID: I found that xgboost generates an exception under RFECV when the number of features remaining falls below 3. I fixed this for myself by adding a 'stop_at' parameter (default=1) that stops the process in RFE when the remaining features falls below this number. I think it might be a useful feature more broadly than simply as a hacked work-around so I offer it as a pull request. George Fisher george at georgefisher.com +1 917-514-8204 https://github.com/grfiv Ubuntu 17.04 Desktop Python 3.5.3 IPython 6.0.0 sklearn 0.18.1 (xgboost 0.6) -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Apr 30 17:50:06 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 30 Apr 2017 17:50:06 -0400 Subject: [scikit-learn] RFE/RFECV parameter suggestion In-Reply-To: References: Message-ID: For RFECV, I think that a min_features parameter could be useful. Alternatively, making XGBoost more scikit-learn compatible instead of making scikit-learn more XGBoost compatible could be another take on this. Best, Sebastian > On Apr 30, 2017, at 3:13 PM, George Fisher wrote: > > I found that xgboost generates an exception under RFECV when the number of features remaining falls below 3. I fixed this for myself by adding a 'stop_at' parameter (default=1) that stops the process in RFE when the remaining features falls below this number. I think it might be a useful feature more broadly than simply as a hacked work-around so I offer it as a pull request. > > George Fisher > george at georgefisher.com > +1 917-514-8204 > https://github.com/grfiv > > Ubuntu 17.04 Desktop > Python 3.5.3 > IPython 6.0.0 > sklearn 0.18.1 > (xgboost 0.6) > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From manojkumarsivaraj334 at gmail.com Sun Apr 30 17:53:47 2017 From: manojkumarsivaraj334 at gmail.com (Manoj Kumar) Date: Sun, 30 Apr 2017 17:53:47 -0400 Subject: [scikit-learn] RFE/RFECV parameter suggestion In-Reply-To: References: Message-ID: See https://github.com/scikit-learn/scikit-learn/issues/6564 and https://github.com/scikit-learn/scikit-learn/pull/7269 On Sun, Apr 30, 2017 at 5:50 PM, Sebastian Raschka wrote: > For RFECV, I think that a min_features parameter could be useful. > > Alternatively, making XGBoost more scikit-learn compatible instead of > making scikit-learn more XGBoost compatible could be another take on this. > > Best, > Sebastian > > > On Apr 30, 2017, at 3:13 PM, George Fisher > wrote: > > > > I found that xgboost generates an exception under RFECV when the number > of features remaining falls below 3. I fixed this for myself by adding a > 'stop_at' parameter (default=1) that stops the process in RFE when the > remaining features falls below this number. I think it might be a useful > feature more broadly than simply as a hacked work-around so I offer it as a > pull request. > > > > George Fisher > > george at georgefisher.com > > +1 917-514-8204 > > https://github.com/grfiv > > > > Ubuntu 17.04 Desktop > > Python 3.5.3 > > IPython 6.0.0 > > sklearn 0.18.1 > > (xgboost 0.6) > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Manoj, http://github.com/MechCoder -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Apr 30 20:17:25 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 1 May 2017 10:17:25 +1000 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: <4945250F-6B85-407B-A0F7-690DB054C28F@gmail.com> References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> <4945250F-6B85-407B-A0F7-690DB054C28F@gmail.com> Message-ID: Sorry, I don't know enough about keras and its terminology. Scikit-learn usually limits itself to datasets where features and targets are a rectangular matrix. But grid search and other model selection tools should allow data of other shapes as long as they can be indexed on the first axis. You may be best off, however, getting support from the Keras folks. On 30 April 2017 at 23:23, Carlton Banks wrote: > It seems like scikit-learn is not able to handle network with multiple > inputs. > Keras documentation states: > > You can use Sequential Keras models (*single-input only*) as part of your > Scikit-Learn workflow via the wrappers found at keras.wrappers.scikit_ > learn.py. > But besides what the wrapper can do.. can scikit-learn really not handle > multiple inputs?.. > > > Den 30. apr. 2017 kl. 14.18 skrev Carlton Banks : > > The shapes are > > print len(train_input)print train_input[0].shapeprint train_output.shape > 33(100, 8, 45, 3)(100, 1, 145) > > > 100 is the batch-size.. > > Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman : > > Scikit-learn should accept a list as X to grid search and index it just > fine. So I'm not sure that constraint applies to Grid Search > > On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente > wrote: > >> Tbh I've never tried, but I would say that te current sklearn API does >> not support multi-input data... >> >> El 30 abr 2017, a las 12:02, Joel Nothman >> escribi?: >> >> What are the shapes of train_input and train_output? >> >> On 30 April 2017 at 12:59, Carlton Banks wrote: >> >>> I am currently trying to run some gridsearchCV on a keras model which >>> has multiple inputs. >>> The inputs is stored in a list in which each entry in the list is a >>> input for a specific channel. >>> >>> >>> Here is my model and how i use the gridsearch. >>> >>> https://pastebin.com/GMKH1L80 >>> >>> The error i am getting is: >>> >>> https://pastebin.com/A3cB0rMv >>> >>> Any idea how i can resolve this? >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Apr 30 20:21:42 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 1 May 2017 10:21:42 +1000 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> <4945250F-6B85-407B-A0F7-690DB054C28F@gmail.com> Message-ID: Do each of your 33 inputs have a batch of size 100? If you reshape your data so that it all fits in one matrix, and then split it back out into its 33 components as the first transformation in a Pipeline, there should be no problem. On 1 May 2017 at 10:17, Joel Nothman wrote: > Sorry, I don't know enough about keras and its terminology. > > Scikit-learn usually limits itself to datasets where features and targets > are a rectangular matrix. > > But grid search and other model selection tools should allow data of other > shapes as long as they can be indexed on the first axis. You may be best > off, however, getting support from the Keras folks. > > On 30 April 2017 at 23:23, Carlton Banks wrote: > >> It seems like scikit-learn is not able to handle network with multiple >> inputs. >> Keras documentation states: >> >> You can use Sequential Keras models (*single-input only*) as part of >> your Scikit-Learn workflow via the wrappers found at >> keras.wrappers.scikit_learn.py. >> But besides what the wrapper can do.. can scikit-learn really not handle >> multiple inputs?.. >> >> >> Den 30. apr. 2017 kl. 14.18 skrev Carlton Banks : >> >> The shapes are >> >> print len(train_input)print train_input[0].shapeprint train_output.shape >> 33(100, 8, 45, 3)(100, 1, 145) >> >> >> 100 is the batch-size.. >> >> Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman : >> >> Scikit-learn should accept a list as X to grid search and index it just >> fine. So I'm not sure that constraint applies to Grid Search >> >> On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente >> wrote: >> >>> Tbh I've never tried, but I would say that te current sklearn API does >>> not support multi-input data... >>> >>> El 30 abr 2017, a las 12:02, Joel Nothman >>> escribi?: >>> >>> What are the shapes of train_input and train_output? >>> >>> On 30 April 2017 at 12:59, Carlton Banks wrote: >>> >>>> I am currently trying to run some gridsearchCV on a keras model which >>>> has multiple inputs. >>>> The inputs is stored in a list in which each entry in the list is a >>>> input for a specific channel. >>>> >>>> >>>> Here is my model and how i use the gridsearch. >>>> >>>> https://pastebin.com/GMKH1L80 >>>> >>>> The error i am getting is: >>>> >>>> https://pastebin.com/A3cB0rMv >>>> >>>> Any idea how i can resolve this? >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Sun Apr 30 21:45:31 2017 From: noflaco at gmail.com (Carlton Banks) Date: Mon, 1 May 2017 03:45:31 +0200 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> <4945250F-6B85-407B-A0F7-690DB054C28F@gmail.com> Message-ID: How ? batchsize could also be 1, I?ve just stored it like that. But how do reshape me data to be a matrix.. thats the big question.. is possible? > Den 1. maj 2017 kl. 02.21 skrev Joel Nothman : > > Do each of your 33 inputs have a batch of size 100? If you reshape your data so that it all fits in one matrix, and then split it back out into its 33 components as the first transformation in a Pipeline, there should be no problem. > > On 1 May 2017 at 10:17, Joel Nothman > wrote: > Sorry, I don't know enough about keras and its terminology. > > Scikit-learn usually limits itself to datasets where features and targets are a rectangular matrix. > > But grid search and other model selection tools should allow data of other shapes as long as they can be indexed on the first axis. You may be best off, however, getting support from the Keras folks. > > On 30 April 2017 at 23:23, Carlton Banks > wrote: > It seems like scikit-learn is not able to handle network with multiple inputs. > Keras documentation states: > > You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers found at keras.wrappers.scikit_learn.py . > > But besides what the wrapper can do.. can scikit-learn really not handle multiple inputs?.. > > >> Den 30. apr. 2017 kl. 14.18 skrev Carlton Banks >: >> >> The shapes are >> >> print len(train_input) >> print train_input[0].shape >> print train_output.shape >> >> 33 >> (100, 8, 45, 3) >> (100, 1, 145) >> >> 100 is the batch-size.. >>> Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman >: >>> >>> Scikit-learn should accept a list as X to grid search and index it just fine. So I'm not sure that constraint applies to Grid Search >>> >>> On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente > wrote: >>> Tbh I've never tried, but I would say that te current sklearn API does not support multi-input data... >>> >>> El 30 abr 2017, a las 12:02, Joel Nothman > escribi?: >>> >>>> What are the shapes of train_input and train_output? >>>> >>>> On 30 April 2017 at 12:59, Carlton Banks > wrote: >>>> I am currently trying to run some gridsearchCV on a keras model which has multiple inputs. >>>> The inputs is stored in a list in which each entry in the list is a input for a specific channel. >>>> >>>> >>>> Here is my model and how i use the gridsearch. >>>> >>>> https://pastebin.com/GMKH1L80 >>>> >>>> The error i am getting is: >>>> >>>> https://pastebin.com/A3cB0rMv >>>> >>>> Any idea how i can resolve this? >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Apr 30 23:19:02 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 1 May 2017 13:19:02 +1000 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> <4945250F-6B85-407B-A0F7-690DB054C28F@gmail.com> Message-ID: Unless I'm mistaken about what we're looking at, you could use something like: class ToMultiInput(TransformerMixin, BaseEstimator): def fit(self, shapes): self.shapes = shapes def transform(self, X): return [X.] tmi = ToMultiInput([single.shape for single in train_input]) # this assumes that train_input is a sequence of ndarrays with the same first dimension: train_input = np.hstack([single.reshape(single.shape[0], -1) for single in train_input]) GridSearchCV(make_pipeline(tmi, my_predictor), ...) On 1 May 2017 at 11:45, Carlton Banks wrote: > How ? batchsize could also be 1, I?ve just stored it like that. > > But how do reshape me data to be a matrix.. thats the big question.. is > possible? > > Den 1. maj 2017 kl. 02.21 skrev Joel Nothman : > > Do each of your 33 inputs have a batch of size 100? If you reshape your > data so that it all fits in one matrix, and then split it back out into its > 33 components as the first transformation in a Pipeline, there should be no > problem. > > On 1 May 2017 at 10:17, Joel Nothman wrote: > >> Sorry, I don't know enough about keras and its terminology. >> >> Scikit-learn usually limits itself to datasets where features and targets >> are a rectangular matrix. >> >> But grid search and other model selection tools should allow data of >> other shapes as long as they can be indexed on the first axis. You may be >> best off, however, getting support from the Keras folks. >> >> On 30 April 2017 at 23:23, Carlton Banks wrote: >> >>> It seems like scikit-learn is not able to handle network with multiple >>> inputs. >>> Keras documentation states: >>> >>> You can use Sequential Keras models (*single-input only*) as part of >>> your Scikit-Learn workflow via the wrappers found at >>> keras.wrappers.scikit_learn.py. >>> But besides what the wrapper can do.. can scikit-learn really not handle >>> multiple inputs?.. >>> >>> >>> Den 30. apr. 2017 kl. 14.18 skrev Carlton Banks : >>> >>> The shapes are >>> >>> print len(train_input)print train_input[0].shapeprint train_output.shape >>> 33(100, 8, 45, 3)(100, 1, 145) >>> >>> >>> 100 is the batch-size.. >>> >>> Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman : >>> >>> Scikit-learn should accept a list as X to grid search and index it just >>> fine. So I'm not sure that constraint applies to Grid Search >>> >>> On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente >> > wrote: >>> >>>> Tbh I've never tried, but I would say that te current sklearn API does >>>> not support multi-input data... >>>> >>>> El 30 abr 2017, a las 12:02, Joel Nothman >>>> escribi?: >>>> >>>> What are the shapes of train_input and train_output? >>>> >>>> On 30 April 2017 at 12:59, Carlton Banks wrote: >>>> >>>>> I am currently trying to run some gridsearchCV on a keras model which >>>>> has multiple inputs. >>>>> The inputs is stored in a list in which each entry in the list is a >>>>> input for a specific channel. >>>>> >>>>> >>>>> Here is my model and how i use the gridsearch. >>>>> >>>>> https://pastebin.com/GMKH1L80 >>>>> >>>>> The error i am getting is: >>>>> >>>>> https://pastebin.com/A3cB0rMv >>>>> >>>>> Any idea how i can resolve this? >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Sun Apr 30 23:22:00 2017 From: noflaco at gmail.com (Carlton Banks) Date: Mon, 1 May 2017 05:22:00 +0200 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> <4945250F-6B85-407B-A0F7-690DB054C28F@gmail.com> Message-ID: <95983596-585F-40D8-95A3-05FCEEE56C15@gmail.com> hmm.. guess I can give it a try.. i currently optimizing with for loops.. > Den 1. maj 2017 kl. 05.19 skrev Joel Nothman : > > Unless I'm mistaken about what we're looking at, you could use something like: > > class ToMultiInput(TransformerMixin, BaseEstimator): > def fit(self, shapes): > self.shapes = shapes > def transform(self, X): > return [X.] > > tmi = ToMultiInput([single.shape for single in train_input]) > # this assumes that train_input is a sequence of ndarrays with the same first dimension: > train_input = np.hstack([single.reshape(single.shape[0], -1) > for single in train_input]) > > GridSearchCV(make_pipeline(tmi, my_predictor), ...) > > > On 1 May 2017 at 11:45, Carlton Banks > wrote: > How ? batchsize could also be 1, I?ve just stored it like that. > > But how do reshape me data to be a matrix.. thats the big question.. is possible? > >> Den 1. maj 2017 kl. 02.21 skrev Joel Nothman >: >> >> Do each of your 33 inputs have a batch of size 100? If you reshape your data so that it all fits in one matrix, and then split it back out into its 33 components as the first transformation in a Pipeline, there should be no problem. >> >> On 1 May 2017 at 10:17, Joel Nothman > wrote: >> Sorry, I don't know enough about keras and its terminology. >> >> Scikit-learn usually limits itself to datasets where features and targets are a rectangular matrix. >> >> But grid search and other model selection tools should allow data of other shapes as long as they can be indexed on the first axis. You may be best off, however, getting support from the Keras folks. >> >> On 30 April 2017 at 23:23, Carlton Banks > wrote: >> It seems like scikit-learn is not able to handle network with multiple inputs. >> Keras documentation states: >> >> You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers found at keras.wrappers.scikit_learn.py . >> >> But besides what the wrapper can do.. can scikit-learn really not handle multiple inputs?.. >> >> >>> Den 30. apr. 2017 kl. 14.18 skrev Carlton Banks >: >>> >>> The shapes are >>> >>> print len(train_input) >>> print train_input[0].shape >>> print train_output.shape >>> >>> 33 >>> (100, 8, 45, 3) >>> (100, 1, 145) >>> >>> 100 is the batch-size.. >>>> Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman >: >>>> >>>> Scikit-learn should accept a list as X to grid search and index it just fine. So I'm not sure that constraint applies to Grid Search >>>> >>>> On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente > wrote: >>>> Tbh I've never tried, but I would say that te current sklearn API does not support multi-input data... >>>> >>>> El 30 abr 2017, a las 12:02, Joel Nothman > escribi?: >>>> >>>>> What are the shapes of train_input and train_output? >>>>> >>>>> On 30 April 2017 at 12:59, Carlton Banks > wrote: >>>>> I am currently trying to run some gridsearchCV on a keras model which has multiple inputs. >>>>> The inputs is stored in a list in which each entry in the list is a input for a specific channel. >>>>> >>>>> >>>>> Here is my model and how i use the gridsearch. >>>>> >>>>> https://pastebin.com/GMKH1L80 >>>>> >>>>> The error i am getting is: >>>>> >>>>> https://pastebin.com/A3cB0rMv >>>>> >>>>> Any idea how i can resolve this? >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Apr 30 23:22:04 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 1 May 2017 13:22:04 +1000 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> <4945250F-6B85-407B-A0F7-690DB054C28F@gmail.com> Message-ID: Sorry, I sent that incomplete (and this obviously remains untested): class ToMultiInput(TransformerMixin, BaseEstimator): def fit(self, shapes): self.shapes = shapes def transform(self, X): shape_sizes = [np.prod(shape) for shape in self.shapes] offsets = np.cumsum([0] + shape_sizes) return [X[start:stop].reshape(shape) for start, stop, shape in zip(offsets, offsets[1:], self.shapes)] tmi = ToMultiInput([single.shape for single in train_input]) train_input = np.hstack([single.reshape(single.shape[0], -1) for single in train_input]) GridSearchCV(make_pipeline(tmi, my_predictor), ...) On 1 May 2017 at 13:19, Joel Nothman wrote: > Unless I'm mistaken about what we're looking at, you could use something > like: > > class ToMultiInput(TransformerMixin, BaseEstimator): > def fit(self, shapes): > self.shapes = shapes > def transform(self, X): > return [X.] > > tmi = ToMultiInput([single.shape for single in train_input]) > # this assumes that train_input is a sequence of ndarrays with the same > first dimension: > train_input = np.hstack([single.reshape(single.shape[0], -1) > for single in train_input]) > > GridSearchCV(make_pipeline(tmi, my_predictor), ...) > > > On 1 May 2017 at 11:45, Carlton Banks wrote: > >> How ? batchsize could also be 1, I?ve just stored it like that. >> >> But how do reshape me data to be a matrix.. thats the big question.. is >> possible? >> >> Den 1. maj 2017 kl. 02.21 skrev Joel Nothman : >> >> Do each of your 33 inputs have a batch of size 100? If you reshape your >> data so that it all fits in one matrix, and then split it back out into its >> 33 components as the first transformation in a Pipeline, there should be no >> problem. >> >> On 1 May 2017 at 10:17, Joel Nothman wrote: >> >>> Sorry, I don't know enough about keras and its terminology. >>> >>> Scikit-learn usually limits itself to datasets where features and >>> targets are a rectangular matrix. >>> >>> But grid search and other model selection tools should allow data of >>> other shapes as long as they can be indexed on the first axis. You may be >>> best off, however, getting support from the Keras folks. >>> >>> On 30 April 2017 at 23:23, Carlton Banks wrote: >>> >>>> It seems like scikit-learn is not able to handle network with multiple >>>> inputs. >>>> Keras documentation states: >>>> >>>> You can use Sequential Keras models (*single-input only*) as part of >>>> your Scikit-Learn workflow via the wrappers found at >>>> keras.wrappers.scikit_learn.py. >>>> But besides what the wrapper can do.. can scikit-learn really not >>>> handle multiple inputs?.. >>>> >>>> >>>> Den 30. apr. 2017 kl. 14.18 skrev Carlton Banks : >>>> >>>> The shapes are >>>> >>>> print len(train_input)print train_input[0].shapeprint train_output.shape >>>> 33(100, 8, 45, 3)(100, 1, 145) >>>> >>>> >>>> 100 is the batch-size.. >>>> >>>> Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman >>> >: >>>> >>>> Scikit-learn should accept a list as X to grid search and index it just >>>> fine. So I'm not sure that constraint applies to Grid Search >>>> >>>> On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente < >>>> julio at esbet.es> wrote: >>>> >>>>> Tbh I've never tried, but I would say that te current sklearn API does >>>>> not support multi-input data... >>>>> >>>>> El 30 abr 2017, a las 12:02, Joel Nothman >>>>> escribi?: >>>>> >>>>> What are the shapes of train_input and train_output? >>>>> >>>>> On 30 April 2017 at 12:59, Carlton Banks wrote: >>>>> >>>>>> I am currently trying to run some gridsearchCV on a keras model which >>>>>> has multiple inputs. >>>>>> The inputs is stored in a list in which each entry in the list is a >>>>>> input for a specific channel. >>>>>> >>>>>> >>>>>> Here is my model and how i use the gridsearch. >>>>>> >>>>>> https://pastebin.com/GMKH1L80 >>>>>> >>>>>> The error i am getting is: >>>>>> >>>>>> https://pastebin.com/A3cB0rMv >>>>>> >>>>>> Any idea how i can resolve this? >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Sun Apr 30 23:23:26 2017 From: noflaco at gmail.com (Carlton Banks) Date: Mon, 1 May 2017 05:23:26 +0200 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> <4945250F-6B85-407B-A0F7-690DB054C28F@gmail.com> Message-ID: <4E896C39-BB8E-485E-94CE-5A71136DAF7E@gmail.com> BaseEstimator being? > Den 1. maj 2017 kl. 05.22 skrev Joel Nothman : > > Sorry, I sent that incomplete (and this obviously remains untested): > > class ToMultiInput(TransformerMixin, BaseEstimator): > def fit(self, shapes): > self.shapes = shapes > def transform(self, X): > shape_sizes = [np.prod(shape) for shape in self.shapes] > offsets = np.cumsum([0] + shape_sizes) > return [X[start:stop].reshape(shape) > for start, stop, shape > in zip(offsets, offsets[1:], self.shapes)] > > tmi = ToMultiInput([single.shape for single in train_input]) > train_input = np.hstack([single.reshape(single.shape[0], -1) > for single in train_input]) > > GridSearchCV(make_pipeline(tmi, my_predictor), ...) > > > On 1 May 2017 at 13:19, Joel Nothman > wrote: > Unless I'm mistaken about what we're looking at, you could use something like: > > class ToMultiInput(TransformerMixin, BaseEstimator): > def fit(self, shapes): > self.shapes = shapes > def transform(self, X): > return [X.] > > tmi = ToMultiInput([single.shape for single in train_input]) > # this assumes that train_input is a sequence of ndarrays with the same first dimension: > train_input = np.hstack([single.reshape(single.shape[0], -1) > for single in train_input]) > > GridSearchCV(make_pipeline(tmi, my_predictor), ...) > > > On 1 May 2017 at 11:45, Carlton Banks > wrote: > How ? batchsize could also be 1, I?ve just stored it like that. > > But how do reshape me data to be a matrix.. thats the big question.. is possible? > >> Den 1. maj 2017 kl. 02.21 skrev Joel Nothman >: >> >> Do each of your 33 inputs have a batch of size 100? If you reshape your data so that it all fits in one matrix, and then split it back out into its 33 components as the first transformation in a Pipeline, there should be no problem. >> >> On 1 May 2017 at 10:17, Joel Nothman > wrote: >> Sorry, I don't know enough about keras and its terminology. >> >> Scikit-learn usually limits itself to datasets where features and targets are a rectangular matrix. >> >> But grid search and other model selection tools should allow data of other shapes as long as they can be indexed on the first axis. You may be best off, however, getting support from the Keras folks. >> >> On 30 April 2017 at 23:23, Carlton Banks > wrote: >> It seems like scikit-learn is not able to handle network with multiple inputs. >> Keras documentation states: >> >> You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers found at keras.wrappers.scikit_learn.py . >> >> But besides what the wrapper can do.. can scikit-learn really not handle multiple inputs?.. >> >> >>> Den 30. apr. 2017 kl. 14.18 skrev Carlton Banks >: >>> >>> The shapes are >>> >>> print len(train_input) >>> print train_input[0].shape >>> print train_output.shape >>> >>> 33 >>> (100, 8, 45, 3) >>> (100, 1, 145) >>> >>> 100 is the batch-size.. >>>> Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman >: >>>> >>>> Scikit-learn should accept a list as X to grid search and index it just fine. So I'm not sure that constraint applies to Grid Search >>>> >>>> On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente > wrote: >>>> Tbh I've never tried, but I would say that te current sklearn API does not support multi-input data... >>>> >>>> El 30 abr 2017, a las 12:02, Joel Nothman > escribi?: >>>> >>>>> What are the shapes of train_input and train_output? >>>>> >>>>> On 30 April 2017 at 12:59, Carlton Banks > wrote: >>>>> I am currently trying to run some gridsearchCV on a keras model which has multiple inputs. >>>>> The inputs is stored in a list in which each entry in the list is a input for a specific channel. >>>>> >>>>> >>>>> Here is my model and how i use the gridsearch. >>>>> >>>>> https://pastebin.com/GMKH1L80 >>>>> >>>>> The error i am getting is: >>>>> >>>>> https://pastebin.com/A3cB0rMv >>>>> >>>>> Any idea how i can resolve this? >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Apr 30 23:27:21 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 1 May 2017 13:27:21 +1000 Subject: [scikit-learn] gridsearchCV able to handle list of input? In-Reply-To: <4E896C39-BB8E-485E-94CE-5A71136DAF7E@gmail.com> References: <94BCAC95-9F07-4CC3-9D5A-D6C04176BF65@esbet.es> <5EF540A8-68B6-4402-B87C-EC6E1B60E01B@gmail.com> <4945250F-6B85-407B-A0F7-690DB054C28F@gmail.com> <4E896C39-BB8E-485E-94CE-5A71136DAF7E@gmail.com> Message-ID: scikit-learn.org/stable/modules/classes.html On 1 May 2017 at 13:23, Carlton Banks wrote: > BaseEstimator being? > > Den 1. maj 2017 kl. 05.22 skrev Joel Nothman : > > Sorry, I sent that incomplete (and this obviously remains untested): > > class ToMultiInput(TransformerMixin, BaseEstimator): > def fit(self, shapes): > self.shapes = shapes > def transform(self, X): > shape_sizes = [np.prod(shape) for shape in self.shapes] > offsets = np.cumsum([0] + shape_sizes) > return [X[start:stop].reshape(shape) > for start, stop, shape > in zip(offsets, offsets[1:], self.shapes)] > > tmi = ToMultiInput([single.shape for single in train_input]) > train_input = np.hstack([single.reshape(single.shape[0], -1) > for single in train_input]) > > GridSearchCV(make_pipeline(tmi, my_predictor), ...) > > > On 1 May 2017 at 13:19, Joel Nothman wrote: > >> Unless I'm mistaken about what we're looking at, you could use something >> like: >> >> class ToMultiInput(TransformerMixin, BaseEstimator): >> def fit(self, shapes): >> self.shapes = shapes >> def transform(self, X): >> return [X.] >> >> tmi = ToMultiInput([single.shape for single in train_input]) >> # this assumes that train_input is a sequence of ndarrays with the same >> first dimension: >> train_input = np.hstack([single.reshape(single.shape[0], -1) >> for single in train_input]) >> >> GridSearchCV(make_pipeline(tmi, my_predictor), ...) >> >> >> On 1 May 2017 at 11:45, Carlton Banks wrote: >> >>> How ? batchsize could also be 1, I?ve just stored it like that. >>> >>> But how do reshape me data to be a matrix.. thats the big question.. is >>> possible? >>> >>> Den 1. maj 2017 kl. 02.21 skrev Joel Nothman : >>> >>> Do each of your 33 inputs have a batch of size 100? If you reshape your >>> data so that it all fits in one matrix, and then split it back out into its >>> 33 components as the first transformation in a Pipeline, there should be no >>> problem. >>> >>> On 1 May 2017 at 10:17, Joel Nothman wrote: >>> >>>> Sorry, I don't know enough about keras and its terminology. >>>> >>>> Scikit-learn usually limits itself to datasets where features and >>>> targets are a rectangular matrix. >>>> >>>> But grid search and other model selection tools should allow data of >>>> other shapes as long as they can be indexed on the first axis. You may be >>>> best off, however, getting support from the Keras folks. >>>> >>>> On 30 April 2017 at 23:23, Carlton Banks wrote: >>>> >>>>> It seems like scikit-learn is not able to handle network with multiple >>>>> inputs. >>>>> Keras documentation states: >>>>> >>>>> You can use Sequential Keras models (*single-input only*) as part of >>>>> your Scikit-Learn workflow via the wrappers found at >>>>> keras.wrappers.scikit_learn.py. >>>>> But besides what the wrapper can do.. can scikit-learn really not >>>>> handle multiple inputs?.. >>>>> >>>>> >>>>> Den 30. apr. 2017 kl. 14.18 skrev Carlton Banks : >>>>> >>>>> The shapes are >>>>> >>>>> print len(train_input)print train_input[0].shapeprint train_output.shape >>>>> 33(100, 8, 45, 3)(100, 1, 145) >>>>> >>>>> >>>>> 100 is the batch-size.. >>>>> >>>>> Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman >>>> >: >>>>> >>>>> Scikit-learn should accept a list as X to grid search and index it >>>>> just fine. So I'm not sure that constraint applies to Grid Search >>>>> >>>>> On 30 April 2017 at 20:11, Julio Antonio Soto de Vicente < >>>>> julio at esbet.es> wrote: >>>>> >>>>>> Tbh I've never tried, but I would say that te current sklearn API >>>>>> does not support multi-input data... >>>>>> >>>>>> El 30 abr 2017, a las 12:02, Joel Nothman >>>>>> escribi?: >>>>>> >>>>>> What are the shapes of train_input and train_output? >>>>>> >>>>>> On 30 April 2017 at 12:59, Carlton Banks wrote: >>>>>> >>>>>>> I am currently trying to run some gridsearchCV on a keras model >>>>>>> which has multiple inputs. >>>>>>> The inputs is stored in a list in which each entry in the list is a >>>>>>> input for a specific channel. >>>>>>> >>>>>>> >>>>>>> Here is my model and how i use the gridsearch. >>>>>>> >>>>>>> https://pastebin.com/GMKH1L80 >>>>>>> >>>>>>> The error i am getting is: >>>>>>> >>>>>>> https://pastebin.com/A3cB0rMv >>>>>>> >>>>>>> Any idea how i can resolve this? >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: