From tevang3 at gmail.com Thu Dec 1 08:01:36 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 1 Dec 2016 14:01:36 +0100 Subject: [scikit-learn] random forests using grouped data Message-ID: Greetings ?I have grouped data which are divided into actives and inactives. The features are two different types of normalized scores (0-1), where the higher the score the most probable is an observation to be an "active". My data look like this: Group1: score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] score2 = [ y=[1,1,1,0,0,0, ...] Group2: ?score1 = [0 score2 = [ y=[1,1,1,1,1]? ?...... Group24?: ?score1 = [0 score2 = [ y=[1,1,1,1,1]? I searched in the documentation about treatment of grouped data, but the only thing I found was how do do cross-validation. My question is whether there is any special algorithm that creates random forests from these type of grouped data. thanks in advance Thomas -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Thu Dec 1 08:05:45 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 1 Dec 2016 14:05:45 +0100 Subject: [scikit-learn] random forests using grouped data In-Reply-To: References: Message-ID: Sorry, the previous email was incomplete. Below is how the grouped data look like: Group1: score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" Group2: score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...] score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" ?...... Group24?: score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...] score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" On 1 December 2016 at 14:01, Thomas Evangelidis wrote: > Greetings > > ?I have grouped data which are divided into actives and inactives. The > features are two different types of normalized scores (0-1), where the > higher the score the most probable is an observation to be an "active". My > data look like this: > > > Group1: > score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] > score2 = [ > y=[1,1,1,0,0,0, ...] > > Group2: > ?score1 = [0 > score2 = [ > y=[1,1,1,1,1]? > > ?...... > Group24?: > ?score1 = [0 > score2 = [ > y=[1,1,1,1,1]? > > > I searched in the documentation about treatment of grouped data, but the > only thing I found was how do do cross-validation. My question is whether > there is any special algorithm that creates random forests from these type > of grouped data. > > thanks in advance > Thomas > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Thu Dec 1 08:16:54 2016 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Thu, 1 Dec 2016 22:16:54 +0900 Subject: [scikit-learn] random forests using grouped data In-Reply-To: References:

Message-ID: Hello Thomas, I don't personally know of any algorithm that works on collections of groupings, but why not first test a simple control model, meaning can you achieve a satisfactory model by simply concatenating all 48 scores per sample and building a forest the standard way? If not, what context or reasons dictate that the groupings need to stay retained as you have presented them? Hope this helps, J.B. 2016-12-01 22:05 GMT+09:00 Thomas Evangelidis : > Sorry, the previous email was incomplete. Below is how the grouped data > look like: > > > Group1: > score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] > score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...] > y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" > > Group2: > score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...] > score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...] > y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" > > ?...... > Group24?: > score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...] > score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...] > y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" > > > On 1 December 2016 at 14:01, Thomas Evangelidis wrote: > >> Greetings >> >> ?I have grouped data which are divided into actives and inactives. The >> features are two different types of normalized scores (0-1), where the >> higher the score the most probable is an observation to be an "active". My >> data look like this: >> >> >> Group1: >> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] >> score2 = [ >> y=[1,1,1,0,0,0, ...] >> >> Group2: >> ?score1 = [0 >> score2 = [ >> y=[1,1,1,1,1]? >> >> ?...... >> Group24?: >> ?score1 = [0 >> score2 = [ >> y=[1,1,1,1,1]? >> >> >> I searched in the documentation about treatment of grouped data, but the >> only thing I found was how do do cross-validation. My question is whether >> there is any special algorithm that creates random forests from these type >> of grouped data. >> >> thanks in advance >> Thomas >> >> >> >> -- >> >> ====================================================================== >> >> Thomas Evangelidis >> >> Research Specialist >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zephyr14 at gmail.com Thu Dec 1 09:04:55 2016 From: zephyr14 at gmail.com (Vlad Niculae) Date: Thu, 1 Dec 2016 09:04:55 -0500 Subject: [scikit-learn] random forests using grouped data In-Reply-To: References:

Message-ID: I don't think there are any such estimators in scikit-learn directly, but the model selection machinery is there to help. Check out GroupKFold [1] so you can do cross-validation after concatenating all the samples, while ensuring that training and validation groups are separate. The setup of this problem looks a lot like query results reranking in information retrieval, where you need to find relevant and non-relevant results among the set of retrieved docs for each search query. A simple approach you can build using scikit-learn tools is RankSVM, where you take, within each group, all possible pairs between a positive and a negative sample, and take the difference of their features as your input. This is the same as optimizing within-group AUC. Unfortunately the trick doesn't work in the same way for nonlinear models, but it's another baseline you could try. Fabian had an example of this, with some VERY enlightening illustrations, here [2]. HTH, Vlad [1] http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html [2] https://github.com/fabianp/minirank/blob/master/notebooks/pairwise_transform.ipynb On Thu, Dec 1, 2016 at 8:16 AM, Brown J.B. wrote: > Hello Thomas, > > I don't personally know of any algorithm that works on collections of > groupings, but why not first test a simple control model, meaning > can you achieve a satisfactory model by simply concatenating all 48 scores > per sample and building a forest the standard way? > If not, what context or reasons dictate that the groupings need to stay > retained as you have presented them? > > Hope this helps, > J.B. > > 2016-12-01 22:05 GMT+09:00 Thomas Evangelidis : >> >> Sorry, the previous email was incomplete. Below is how the grouped data >> look like: >> >> >> Group1: >> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] >> score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...] >> y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" >> >> Group2: >> score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...] >> score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...] >> y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" >> >> ...... >> Group24: >> score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...] >> score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...] >> y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" >> >> >> On 1 December 2016 at 14:01, Thomas Evangelidis wrote: >>> >>> Greetings >>> >>> I have grouped data which are divided into actives and inactives. The >>> features are two different types of normalized scores (0-1), where the >>> higher the score the most probable is an observation to be an "active". My >>> data look like this: >>> >>> >>> Group1: >>> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] >>> score2 = [ >>> y=[1,1,1,0,0,0, ...] >>> >>> Group2: >>> score1 = [0 >>> score2 = [ >>> y=[1,1,1,1,1] >>> >>> ...... >>> Group24: >>> score1 = [0 >>> score2 = [ >>> y=[1,1,1,1,1] >>> >>> >>> I searched in the documentation about treatment of grouped data, but the >>> only thing I found was how do do cross-validation. My question is whether >>> there is any special algorithm that creates random forests from these type >>> of grouped data. >>> >>> thanks in advance >>> Thomas >>> >>> >>> >>> -- >>> >>> ====================================================================== >>> >>> Thomas Evangelidis >>> >>> Research Specialist >>> >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/1S081, >>> 62500 Brno, Czech Republic >>> >>> email: tevang at pharm.uoa.gr >>> >>> tevang3 at gmail.com >>> >>> >>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >> >> >> >> -- >> >> ====================================================================== >> >> Thomas Evangelidis >> >> Research Specialist >> >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From deliprao at gmail.com Thu Dec 1 22:33:33 2016 From: deliprao at gmail.com (Delip Rao) Date: Fri, 02 Dec 2016 03:33:33 +0000 Subject: [scikit-learn] [semi-supervised learning] Using a pre-existing graph with LabelSpreading API Message-ID: Hello, I have an existing graph dataset in the edge format: node_i node_j weight The number of nodes are around 3.6M, and the number of edges are around 72M. I also have some labeled data (around a dozen per class with 16 classes in total), so overall, a perfect setting for label propagation or its variants. In particular, I want to try the LabelSpreading implementation for the regularization. I looked at the documentation and can't find a way to plug in a pre-computed graph (or adjacency matrix). So two questions: 1. What are any scaling issues I should be aware of for a dataset of this size? I can try sparsifying the graph, but would love to learn any knobs I should be aware of. 2. How do I plugin an existing weighted graph with the current API? Happy to use any undocumented features. Thanks in advance! Delip -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Dec 2 19:52:09 2016 From: t3kcit at gmail.com (Andy) Date: Fri, 2 Dec 2016 19:52:09 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References:

<41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>

Message-ID: So did we ever decide on how to prioritize reviews? (I was still mentally / notification catching up after 0.18.1) There are some really important issues to tackle, often with proposed solutions, not no reviews! It's hard for everybody to keep the big picture in mind with such a full issue tracker. I think it might be helpful if Joel and me prioritize issues. Obviously that will only make sense if the other team members check up on it when deciding what to review / work on. Do we want to try to seriously use the project feature? https://github.com/scikit-learn/scikit-learn/projects/5 On my monitor I can fit four columns and the "add cards" tab. I tried using five columns (separating in-progress and stalled PRs) but then I could access the right-most column when the "add cards" was open. The whole interface is a bit awkward but maybe the best we have (for example moving something from the bottom to the top is easiest by moving it to a different column, then scrolling up, then moving it back) wdyt? Andy On 09/29/2016 11:05 PM, Joel Nothman wrote: > The spreadsheet seems to have some duplications and presumably some > missing rows, with apologies. I assume some is due to the github > pagination, and some may be my error. Not a big enough error to fix up. > > On 30 September 2016 at 05:15, Raphael C > wrote: > > My apologies I see it is in the spreadsheet. It would be great to see > this work finished for 0.19 if at all possible IMHO. > > Raphael > > On 29 September 2016 at 20:12, Raphael C > wrote: > > I hope this isn't out of place but I notice that > > https://github.com/scikit-learn/scikit-learn/pull/4899 > is not in the > > list. It seems like a very worthwhile addition and the PR appears > > stalled at present. > > > > Raphael > > > > On 29 September 2016 at 15:05, Joel Nothman > > wrote: > >> I agree that being able to identify which PRs are stalled on the > >> contributor's part, which on reviewers' part, and since when, > would be > >> great. I'm not sure we've come up with a way that'll work though. > >> > >> In terms of backlog, I've wondered if just getting things into > a spreadsheet > >> would help: > >> > >> > https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929JpAe9958YxKCubjE/edit > > >> > >> What other features of an Issue / PR would be useful to > >> sort/filter/pivottable on in a spreadsheet form like this? > >> > >> (It would be extra nice if we could modify titles and labels > within the > >> spreadsheet and have them update via the GitHub API, but I'm > not sure I'll > >> get around to making that feature :P) > >> > >> > >> On 29 September 2016 at 23:45, Andreas Mueller > > wrote: > >>> > >>> So I made a project for 0.19: > >>> > >>> https://github.com/scikit-learn/scikit-learn/projects/5 > > >>> > >>> The idea would be to drag and drop issues and PRs so that the > important > >>> ones are at the top. > >>> We could also add an "important" column, currently the > scrolling is pretty > >>> annoying. > >>> Thoughts? > >>> > >>> > >>> > >>> > >>> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote: > >>>> > >>>> On 28 September 2016 at 12:24, Andreas Mueller > > wrote: > >>>>> > >>>>> > >>>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote: > >>>>>> > >>>>>> > >>>>>> I think the only ones worth having are the ones that can be > dealt with > >>>>>> automatically and the ones that will not be used frequently: > >>>>>> > >>>>>> - stalled after 30 days of inactivity [can be done > automatically] > >>>>>> - in dispute [I don't expect it to be used often]. > >>>>> > >>>>> I think "in dispute" is actually one of the most common > statuses among > >>>>> PRs. > >>>>> Or maybe I have a skewed picture of things. > >>>>> Many PRs stalled because it is not clear whether the > proposed solution > >>>>> is a > >>>>> good one. > >>>> > >>>> On the stalled one, sure, but there are a lot of PRs being merged > >>>> fairly quickly. So over all, I think it is quite rare. No? > >>>> > >>>>> It would be great to have some way to get through the > backlog of 400 PRs > >>>>> and > >>>>> I think tagging them might be useful. > >>>>> We rarely reject PRs, we could also revisit that policy. > >>>>> > >>>>> For the backlog, it's pretty unclear to me how many are > waiting for > >>>>> reviews, > >>>>> how many are waiting for changes, > >>>>> and how many are disputed. > >>>>> Tagging these might help people who want to review to find > things to > >>>>> review, > >>>>> and people who want to code to pick > >>>>> up stalled PRs. > >>>> > >>>> That sounds like a great use of labels, thought all of these > need to > >>>> be tagged manually. > >>>> > >>>>> _______________________________________________ > >>>>> scikit-learn mailing list > >>>>> scikit-learn at python.org > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>> > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn at python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > >> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Fri Dec 2 20:04:10 2016 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Fri, 2 Dec 2016 17:04:10 -0800 Subject: [scikit-learn] Github project management tools In-Reply-To: References:

<41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>

Message-ID: Hello, This seems a good moment to say that we will be starting a project at BIDS next semester to try extract information from github and classify PRs into different categories (stalled, updated, needs review). St?fan drafted a list of elements he would like to see for scikit-image, and I have been wanting something similar for matplotlib. I've got my hands full right now, but we are more than open to discuss with the wider community to see if such a tool would be useful and what features is of interest. Here are some examples of elements we'd like to be able to identify and sort: - Most active pull requests ?hot topics? - The one where "I" have commented on. - PRs that haven?t seen any discussion. - Stalled PRs. - New issues without any comments. - See the old PRs that could be merged - Recently merged PR referring to a ticket but haven?t closed that ticket. - Duplicate PR (closing the same ticket). - Tickets that being referred to many times. - Unmergeable PRs (that need to be rebased). - PRs that passed the majority of tests. - Issues that external projects refer too. Do you think something like this could be interesting for sklearn? Also, if you have scripts that similar things and that you would be willing to share, we would be very happy to see what exists already out there. Cheers, N On 2 December 2016 at 16:52, Andy wrote: > So did we ever decide on how to prioritize reviews? > (I was still mentally / notification catching up after 0.18.1) > > There are some really important issues to tackle, often with proposed > solutions, not no reviews! > It's hard for everybody to keep the big picture in mind with such a full > issue tracker. > I think it might be helpful if Joel and me prioritize issues. Obviously that > will only make > sense if the other team members check up on it when deciding what to review > / work on. > > Do we want to try to seriously use the project feature? > https://github.com/scikit-learn/scikit-learn/projects/5 > > On my monitor I can fit four columns and the "add cards" tab. > I tried using five columns (separating in-progress and stalled PRs) but then > I could access the right-most column when > the "add cards" was open. > The whole interface is a bit awkward but maybe the best we have (for example > moving something from the bottom > to the top is easiest by moving it to a different column, then scrolling up, > then moving it back) > > wdyt? > Andy > > > > On 09/29/2016 11:05 PM, Joel Nothman wrote: > > The spreadsheet seems to have some duplications and presumably some missing > rows, with apologies. I assume some is due to the github pagination, and > some may be my error. Not a big enough error to fix up. > > On 30 September 2016 at 05:15, Raphael C wrote: >> >> My apologies I see it is in the spreadsheet. It would be great to see >> this work finished for 0.19 if at all possible IMHO. >> >> Raphael >> >> On 29 September 2016 at 20:12, Raphael C wrote: >> > I hope this isn't out of place but I notice that >> > https://github.com/scikit-learn/scikit-learn/pull/4899 is not in the >> > list. It seems like a very worthwhile addition and the PR appears >> > stalled at present. >> > >> > Raphael >> > >> > On 29 September 2016 at 15:05, Joel Nothman >> > wrote: >> >> I agree that being able to identify which PRs are stalled on the >> >> contributor's part, which on reviewers' part, and since when, would be >> >> great. I'm not sure we've come up with a way that'll work though. >> >> >> >> In terms of backlog, I've wondered if just getting things into a >> >> spreadsheet >> >> would help: >> >> >> >> >> >> https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929JpAe9958YxKCubjE/edit >> >> >> >> What other features of an Issue / PR would be useful to >> >> sort/filter/pivottable on in a spreadsheet form like this? >> >> >> >> (It would be extra nice if we could modify titles and labels within the >> >> spreadsheet and have them update via the GitHub API, but I'm not sure >> >> I'll >> >> get around to making that feature :P) >> >> >> >> >> >> On 29 September 2016 at 23:45, Andreas Mueller >> >> wrote: >> >>> >> >>> So I made a project for 0.19: >> >>> >> >>> https://github.com/scikit-learn/scikit-learn/projects/5 >> >>> >> >>> The idea would be to drag and drop issues and PRs so that the >> >>> important >> >>> ones are at the top. >> >>> We could also add an "important" column, currently the scrolling is >> >>> pretty >> >>> annoying. >> >>> Thoughts? >> >>> >> >>> >> >>> >> >>> >> >>> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote: >> >>>> >> >>>> On 28 September 2016 at 12:24, Andreas Mueller >> >>>> wrote: >> >>>>> >> >>>>> >> >>>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote: >> >>>>>> >> >>>>>> >> >>>>>> I think the only ones worth having are the ones that can be dealt >> >>>>>> with >> >>>>>> automatically and the ones that will not be used frequently: >> >>>>>> >> >>>>>> - stalled after 30 days of inactivity [can be done automatically] >> >>>>>> - in dispute [I don't expect it to be used often]. >> >>>>> >> >>>>> I think "in dispute" is actually one of the most common statuses >> >>>>> among >> >>>>> PRs. >> >>>>> Or maybe I have a skewed picture of things. >> >>>>> Many PRs stalled because it is not clear whether the proposed >> >>>>> solution >> >>>>> is a >> >>>>> good one. >> >>>> >> >>>> On the stalled one, sure, but there are a lot of PRs being merged >> >>>> fairly quickly. So over all, I think it is quite rare. No? >> >>>> >> >>>>> It would be great to have some way to get through the backlog of 400 >> >>>>> PRs >> >>>>> and >> >>>>> I think tagging them might be useful. >> >>>>> We rarely reject PRs, we could also revisit that policy. >> >>>>> >> >>>>> For the backlog, it's pretty unclear to me how many are waiting for >> >>>>> reviews, >> >>>>> how many are waiting for changes, >> >>>>> and how many are disputed. >> >>>>> Tagging these might help people who want to review to find things to >> >>>>> review, >> >>>>> and people who want to code to pick >> >>>>> up stalled PRs. >> >>>> >> >>>> That sounds like a great use of labels, thought all of these need to >> >>>> be tagged manually. >> >>>> >> >>>>> _______________________________________________ >> >>>>> scikit-learn mailing list >> >>>>> scikit-learn at python.org >> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>>> >> >>>> _______________________________________________ >> >>>> scikit-learn mailing list >> >>>> scikit-learn at python.org >> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>> >> >>> >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn at python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From t3kcit at gmail.com Fri Dec 2 21:28:14 2016 From: t3kcit at gmail.com (Andy) Date: Fri, 2 Dec 2016 21:28:14 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References:

<41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>

Message-ID: Hey Nelle. That sounds great. My main question is how you'd expose this to the user. Will it be a separate website? A bot? Emails? Greasemonkey on top of github? Most of these could be implemented with tags that are automatically assigned by a bot, I guess. That would be quite a few tags, though, and wouldn't work well for filtering the ones I was active in. Tickets that are being referred to many times also sound more like a sorting of issues, not a tag. And some of these are more of a "notification type", like "this project has referred to this issue" is maybe something that I want to be made aware of, say by a comment on the issue (which triggers an email) or a direct email to me. Similarly I might be notified if someone forgot to close the ticket for a PR (so I can go and check whether to close it). I might want to be notified if any of my PRs become "unmergable". A comment by a bot would alert everybody though, and an email to me only me. The "PRs that haven't seen any discussion" is actually implemented in github by sorting by comments, and I recently used that. Also happy to (try to find time to) contribute code or discuss the project with you guys! To summarize, I think there are some low-hanging fruit for automatic tagging and for sending emails with notifications, and possibly for bots commenting. I expect that doing anything that involves sorting (a subset of) issues probably requires much more effort. Andy On 12/02/2016 08:04 PM, Nelle Varoquaux wrote: > Hello, > > This seems a good moment to say that we will be starting a project at > BIDS next semester to try extract information from github and classify > PRs into different categories (stalled, updated, needs review). > St?fan drafted a list of elements he would like to see for > scikit-image, and I have been wanting something similar for > matplotlib. > I've got my hands full right now, but we are more than open to discuss > with the wider community to see if such a tool would be useful and > what features is of interest. > > Here are some examples of elements we'd like to be able to identify and sort: > > - Most active pull requests ?hot topics? > - The one where "I" have commented on. > - PRs that haven?t seen any discussion. > - Stalled PRs. > - New issues without any comments. > - See the old PRs that could be merged > - Recently merged PR referring to a ticket but haven?t closed that ticket. > - Duplicate PR (closing the same ticket). > - Tickets that being referred to many times. > - Unmergeable PRs (that need to be rebased). > - PRs that passed the majority of tests. > - Issues that external projects refer too. > > Do you think something like this could be interesting for sklearn? > Also, if you have scripts that similar things and that you would be > willing to share, we would be very happy to see what exists already > out there. > > Cheers, > N > > On 2 December 2016 at 16:52, Andy wrote: >> So did we ever decide on how to prioritize reviews? >> (I was still mentally / notification catching up after 0.18.1) >> >> There are some really important issues to tackle, often with proposed >> solutions, not no reviews! >> It's hard for everybody to keep the big picture in mind with such a full >> issue tracker. >> I think it might be helpful if Joel and me prioritize issues. Obviously that >> will only make >> sense if the other team members check up on it when deciding what to review >> / work on. >> >> Do we want to try to seriously use the project feature? >> https://github.com/scikit-learn/scikit-learn/projects/5 >> >> On my monitor I can fit four columns and the "add cards" tab. >> I tried using five columns (separating in-progress and stalled PRs) but then >> I could access the right-most column when >> the "add cards" was open. >> The whole interface is a bit awkward but maybe the best we have (for example >> moving something from the bottom >> to the top is easiest by moving it to a different column, then scrolling up, >> then moving it back) >> >> wdyt? >> Andy >> >> >> >> On 09/29/2016 11:05 PM, Joel Nothman wrote: >> >> The spreadsheet seems to have some duplications and presumably some missing >> rows, with apologies. I assume some is due to the github pagination, and >> some may be my error. Not a big enough error to fix up. >> >> On 30 September 2016 at 05:15, Raphael C wrote: >>> My apologies I see it is in the spreadsheet. It would be great to see >>> this work finished for 0.19 if at all possible IMHO. >>> >>> Raphael >>> >>> On 29 September 2016 at 20:12, Raphael C wrote: >>>> I hope this isn't out of place but I notice that >>>> https://github.com/scikit-learn/scikit-learn/pull/4899 is not in the >>>> list. It seems like a very worthwhile addition and the PR appears >>>> stalled at present. >>>> >>>> Raphael >>>> >>>> On 29 September 2016 at 15:05, Joel Nothman >>>> wrote: >>>>> I agree that being able to identify which PRs are stalled on the >>>>> contributor's part, which on reviewers' part, and since when, would be >>>>> great. I'm not sure we've come up with a way that'll work though. >>>>> >>>>> In terms of backlog, I've wondered if just getting things into a >>>>> spreadsheet >>>>> would help: >>>>> >>>>> >>>>> https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929JpAe9958YxKCubjE/edit >>>>> >>>>> What other features of an Issue / PR would be useful to >>>>> sort/filter/pivottable on in a spreadsheet form like this? >>>>> >>>>> (It would be extra nice if we could modify titles and labels within the >>>>> spreadsheet and have them update via the GitHub API, but I'm not sure >>>>> I'll >>>>> get around to making that feature :P) >>>>> >>>>> >>>>> On 29 September 2016 at 23:45, Andreas Mueller >>>>> wrote: >>>>>> So I made a project for 0.19: >>>>>> >>>>>> https://github.com/scikit-learn/scikit-learn/projects/5 >>>>>> >>>>>> The idea would be to drag and drop issues and PRs so that the >>>>>> important >>>>>> ones are at the top. >>>>>> We could also add an "important" column, currently the scrolling is >>>>>> pretty >>>>>> annoying. >>>>>> Thoughts? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote: >>>>>>> On 28 September 2016 at 12:24, Andreas Mueller >>>>>>> wrote: >>>>>>>> >>>>>>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote: >>>>>>>>> >>>>>>>>> I think the only ones worth having are the ones that can be dealt >>>>>>>>> with >>>>>>>>> automatically and the ones that will not be used frequently: >>>>>>>>> >>>>>>>>> - stalled after 30 days of inactivity [can be done automatically] >>>>>>>>> - in dispute [I don't expect it to be used often]. >>>>>>>> I think "in dispute" is actually one of the most common statuses >>>>>>>> among >>>>>>>> PRs. >>>>>>>> Or maybe I have a skewed picture of things. >>>>>>>> Many PRs stalled because it is not clear whether the proposed >>>>>>>> solution >>>>>>>> is a >>>>>>>> good one. >>>>>>> On the stalled one, sure, but there are a lot of PRs being merged >>>>>>> fairly quickly. So over all, I think it is quite rare. No? >>>>>>> >>>>>>>> It would be great to have some way to get through the backlog of 400 >>>>>>>> PRs >>>>>>>> and >>>>>>>> I think tagging them might be useful. >>>>>>>> We rarely reject PRs, we could also revisit that policy. >>>>>>>> >>>>>>>> For the backlog, it's pretty unclear to me how many are waiting for >>>>>>>> reviews, >>>>>>>> how many are waiting for changes, >>>>>>>> and how many are disputed. >>>>>>>> Tagging these might help people who want to review to find things to >>>>>>>> review, >>>>>>>> and people who want to code to pick >>>>>>>> up stalled PRs. >>>>>>> That sounds like a great use of labels, thought all of these need to >>>>>>> be tagged manually. >>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Fri Dec 2 21:34:39 2016 From: t3kcit at gmail.com (Andy) Date: Fri, 2 Dec 2016 21:34:39 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References:

<41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>

Message-ID: <27deb6f1-03fe-acf5-c549-67fbc1b2f7d1@gmail.com> Another fun shortcoming of the project interface: If a card is already present in your project, you can not search for it (though you can ctrl+f) From matteo at mycarta.ca Fri Dec 2 22:28:38 2016 From: matteo at mycarta.ca (Matteo Niccoli) Date: Fri, 2 Dec 2016 22:28:38 -0500 Subject: [scikit-learn] Trying to get learning curves with custom scorer and leave one group out Message-ID: <087b462738da5f6bef59c9eac0c7bc08.squirrel@mycarta.ca> HI all, I want to plot learning curves on a trained SVM classifier, using a custom scorer, and using Leave One Group Out as the method of crossvalidation. I thought I had it figured out, but two different scorers - 'f1_micro' and 'accuracy' - will yield identical values. I am confused, is that supposed to be the case? Here's my code (unfortunately I cannot share the data as it is not open): from sklearn import svm SVC_classifier_LOWO_VC0 = svm.SVC(cache_size=800, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf', max_iter=-1, probability=False, random_state=1, shrinking=True, tol=0.001, verbose=False) training_data = pd.read_csv('training_data.csv') scaler = preprocessing.StandardScaler().fit(X) X = scaler.transform(X) y = training_data['Targets'].values groups = training_data["Groups"].values Fscorer = make_scorer(f1_score, average = 'micro') logo = LeaveOneGroupOut() parm_range0 = np.logspace(-2, 6, 9) train_scores0, test_scores0 = validation_curve(SVC_classifier_LOWO_VC0, X, y, "C", parm_range0, cv =logo.split(X, y, groups=groups), scoring = Fscorer) Now, from: train_scores_mean0 = np.mean(train_scores0, axis=1) train_scores_std0 = np.std(train_scores0, axis=1) test_scores_mean0 = np.mean(test_scores0, axis=1) test_scores_std0 = np.std(test_scores0, axis=1) print test_scores_mean0 print np.amax(test_scores_mean0) print np.logspace(-2, 6, 9)[test_scores_mean0.argmax(axis=0)] I get: [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 0.49426622 0.48066419 0.4868987 ] 0.502174200206 100.0 If I create a new classifier, but with the same parameters, and run everything exactly as before, except for the scoring, e.g.: parm_range1 = np.logspace(-2, 6, 9) train_scores1, test_scores1 = validation_curve(SVC_classifier_LOWO_VC1, X, y, "C", parm_range1, cv =logo.split(X, y, groups=wells), scoring = 'accuracy') train_scores_mean1 = np.mean(train_scores1, axis=1) train_scores_std1= np.std(train_scores1, axis=1) test_scores_mean1 = np.mean(test_scores1, axis=1) test_scores_std1 = np.std(test_scores1, axis=1) print test_scores_mean1 print np.amax(test_scores_mean1) print np.logspace(-2, 6, 9)[test_scores_mean1.argmax(axis=0)] I get exactly the same answer: [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 0.49426622 0.48066419 0.4868987 ] 0.502174200206 100.0 How is that possible, am I doing something wrong, or missing something? Thanks From matteo at mycarta.ca Fri Dec 2 22:40:05 2016 From: matteo at mycarta.ca (Matteo Niccoli) Date: Fri, 2 Dec 2016 22:40:05 -0500 Subject: [scikit-learn] Trying to get learning curves with custom scorer and leave one group out In-Reply-To: <087b462738da5f6bef59c9eac0c7bc08.squirrel@mycarta.ca> References: <087b462738da5f6bef59c9eac0c7bc08.squirrel@mycarta.ca> Message-ID: <8feb0c3aa67fc63a3754d266e053f2e6.squirrel@mycarta.ca> My apologies, there was a typo in the code below, second example, should read: train_scores1, test_scores1 = validation_curve(SVC_classifier_LOWO_VC1, X, y, "C", parm_range1, cv =logo.split(X, y, groups=groups), scoring = 'accuracy') Everything else is correct. On Fri, December 2, 2016 10:28 pm, Matteo Niccoli wrote: > HI all, > > > I want to plot learning curves on a trained SVM classifier, using a > custom scorer, and using Leave One Group Out as the method of > crossvalidation. I thought I had it figured out, but two different scorers > - 'f1_micro' and > 'accuracy' - will yield identical values. I am confused, is that supposed > to be the case? > > Here's my code (unfortunately I cannot share the data as it is not open): > > > from sklearn import svm SVC_classifier_LOWO_VC0 = svm.SVC(cache_size=800, > class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, > gamma=0.01, kernel='rbf', max_iter=-1, probability=False, random_state=1, > shrinking=True, tol=0.001, verbose=False) training_data = > pd.read_csv('training_data.csv') scaler = > preprocessing.StandardScaler().fit(X) X = scaler.transform(X) > y = training_data['Targets'].values groups = training_data["Groups"].values > Fscorer = make_scorer(f1_score, average = 'micro') > logo = LeaveOneGroupOut() parm_range0 = np.logspace(-2, 6, 9) train_scores0, > test_scores0 = validation_curve(SVC_classifier_LOWO_VC0, X, y, "C", > parm_range0, cv =logo.split(X, y, groups=groups), scoring = Fscorer) > > > Now, from: > train_scores_mean0 = np.mean(train_scores0, axis=1) train_scores_std0 = > np.std(train_scores0, axis=1) test_scores_mean0 = np.mean(test_scores0, > axis=1) test_scores_std0 = np.std(test_scores0, axis=1) print > test_scores_mean0 print np.amax(test_scores_mean0) print np.logspace(-2, > 6, 9)[test_scores_mean0.argmax(axis=0)] > > > I get: > [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 > 0.49426622 0.48066419 0.4868987 ] > 0.502174200206 > 100.0 > > > If I create a new classifier, but with the same parameters, and run > everything exactly as before, except for the scoring, e.g.: > > parm_range1 = np.logspace(-2, 6, 9) train_scores1, test_scores1 = > validation_curve(SVC_classifier_LOWO_VC1, X, y, "C", parm_range1, cv > =logo.split(X, y, groups=wells), scoring = > 'accuracy') > train_scores_mean1 = np.mean(train_scores1, axis=1) train_scores_std1= > np.std(train_scores1, axis=1) test_scores_mean1 = np.mean(test_scores1, > axis=1) test_scores_std1 = np.std(test_scores1, axis=1) print > test_scores_mean1 print np.amax(test_scores_mean1) print np.logspace(-2, > 6, 9)[test_scores_mean1.argmax(axis=0)] > > > I get exactly the same answer: > [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 > 0.49426622 0.48066419 0.4868987 ] > 0.502174200206 > 100.0 > > > How is that possible, am I doing something wrong, or missing something? > > > Thanks > > > From alekhka at gmail.com Sat Dec 3 04:38:00 2016 From: alekhka at gmail.com (Alekh Karkada Ashok) Date: Sat, 3 Dec 2016 15:08:00 +0530 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References: Message-ID: Hi all, I want use the Scikit-learn's MLPRegressor to map image to image. That is I have a numpy array of size [1000,2030400] (1000 samples, 76800x3 (RGB) pixels). Corresponding labelled images I have. Therefore Y is also [1000,230400]. But according to documentation: *fit(X, y)* Fit the model to data matrix X and target y. *Parameters:* *X : *{array-like, sparse matrix}, shape (n_samples, n_features) The input data. *y : *array-like, shape (n_samples,) The target values. *Returns:* self : returns a trained MLP model. We can see that Y should be a column matrix. Does this mean Scikit-learn doesn't support multiple outputs? I am getting MemoryError when I try to fit now. More: http://stackoverflow.com/questions/40945791/ memoryerror-in-scikit-learn Please help. Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sat Dec 3 05:29:26 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 3 Dec 2016 11:29:26 +0100 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References:

Message-ID: <20161203102926.GG455403@phare.normalesup.org> On Sat, Dec 03, 2016 at 03:08:00PM +0530, Alekh Karkada Ashok wrote: > I want use the Scikit-learn's MLPRegressor to map image to image. That is I > have a numpy array of size [1000,2030400] (1000 samples, 76800x3 (RGB) pixels). > Corresponding labelled images I have. Therefore Y is also [1000,230400]. But > according to documentation: 1 thousands samples and 2030 thousands features: you are using the wrong tool, I multi-layer perceptron model will be too complex and overfit in these settings. I would suggest a ridge. > We can see that Y should be a column matrix. Does this mean Scikit-learn > doesn't support multiple outputs? I believe that this is a documentation error. Could you open an issue (only on the documentation error) > I am getting MemoryError when I try to fit now. > More: http://stackoverflow.com/questions/40945791/memoryerror-in-scikit-learn I believe that your problem is too high-dimensional; Too many features. G From gael.varoquaux at normalesup.org Sat Dec 3 05:52:15 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 3 Dec 2016 11:52:15 +0100 Subject: [scikit-learn] Github project management tools In-Reply-To: References:

<41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>

Message-ID: <20161203105215.GH455403@phare.normalesup.org> On Fri, Dec 02, 2016 at 07:52:09PM -0500, Andy wrote: > So did we ever decide on how to prioritize reviews? I don't know how to do this. > I think it might be helpful if Joel and me prioritize issues. I think that it would be useful. Although of course different people will have different priorities (depending for instance on the type of data that we process). I guess that we can agree on a large part of the prioritization, and hence it will be useful. > Obviously that will only make sense if the other team members check up > on it when deciding what to review / work on. So, the big question is: how do we do this? Isn't there on of the many project-management extension of github that enables this? From ragvrv at gmail.com Sat Dec 3 12:26:29 2016 From: ragvrv at gmail.com (Raghav R V) Date: Sat, 3 Dec 2016 18:26:29 +0100 Subject: [scikit-learn] Github project management tools In-Reply-To: <20161203105215.GH455403@phare.normalesup.org> References:

<41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>

<20161203105215.GH455403@phare.normalesup.org> Message-ID: We could start with assigning priority labels like they use in numpy... That + milestones could help us prioritize? On Sat, Dec 3, 2016 at 11:52 AM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > On Fri, Dec 02, 2016 at 07:52:09PM -0500, Andy wrote: > > So did we ever decide on how to prioritize reviews? > > I don't know how to do this. > > > I think it might be helpful if Joel and me prioritize issues. > > I think that it would be useful. Although of course different people will > have different priorities (depending for instance on the type of data > that we process). I guess that we can agree on a large part of the > prioritization, and hence it will be useful. > > > Obviously that will only make sense if the other team members check up > > on it when deciding what to review / work on. > > So, the big question is: how do we do this? Isn't there on of the many > project-management extension of github that enables this? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From avisochek3 at gmail.com Sat Dec 3 12:19:52 2016 From: avisochek3 at gmail.com (Allan Visochek) Date: Sat, 3 Dec 2016 12:19:52 -0500 Subject: [scikit-learn] Markov Clustering? Message-ID: Hi there, My name is Allan Visochek, I'm a data scientist and web developer and I love scikit-learn so first of all, thanks so much for the work that you do. I'm reaching out because I've found the markov clustering algorithm to be quite useful for me in some of my work and noticed that there is no implementation in scikit-learn, is anybody working on this? If not, id be happy to take this on. I'm new to open source, but I've been working with python for a few years now. Best, -Allan -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sat Dec 3 13:08:55 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 13:08:55 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References:

<41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>

<20161203105215.GH455403@phare.normalesup.org> Message-ID: <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> On 12/03/2016 12:26 PM, Raghav R V wrote: > We could start with assigning priority labels like they use in > numpy... That + milestones could help us prioritize? > I feel milestones are too coarse. Or I'm using them wrong. And priority labels only work if people don't use the "high priority" all the time. There is a lot of stuff labeled "bug", which I would interpret as "highest priority" that people don't look at at all. From t3kcit at gmail.com Sat Dec 3 13:10:36 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 13:10:36 -0500 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: <20161203102926.GG455403@phare.normalesup.org> References:

<20161203102926.GG455403@phare.normalesup.org> Message-ID: On 12/03/2016 05:29 AM, Gael Varoquaux wrote: > On Sat, Dec 03, 2016 at 03:08:00PM +0530, Alekh Karkada Ashok wrote: >> I want use the Scikit-learn's MLPRegressor to map image to image. That is I >> have a numpy array of size [1000,2030400] (1000 samples, 76800x3 (RGB) pixels). >> Corresponding labelled images I have. Therefore Y is also [1000,230400]. But >> according to documentation: > 1 thousands samples and 2030 thousands features: you are using the wrong > tool, I multi-layer perceptron model will be too complex and overfit in > these settings. I would suggest a ridge. > > These are images! Don't use ridge, use a convolutional neural network. Our MLP is not convolutional, it will not be useful. There is a lot of material out there on how to use covolutional neural networks for image labeling (it looks like you have one label per pixel, not per image) From t3kcit at gmail.com Sat Dec 3 13:13:50 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 13:13:50 -0500 Subject: [scikit-learn] Trying to get learning curves with custom scorer and leave one group out In-Reply-To: <8feb0c3aa67fc63a3754d266e053f2e6.squirrel@mycarta.ca> References: <087b462738da5f6bef59c9eac0c7bc08.squirrel@mycarta.ca> <8feb0c3aa67fc63a3754d266e053f2e6.squirrel@mycarta.ca> Message-ID: <264dc532-ed6c-6aed-ac1c-7a6fbad2c2b5@gmail.com> That indeed looks odd. Can you reproduce with synthetic data? On 12/02/2016 10:40 PM, Matteo Niccoli wrote: > My apologies, there was a typo in the code below, second example, should > read: > > train_scores1, test_scores1 = validation_curve(SVC_classifier_LOWO_VC1, X, > y, "C", parm_range1, cv =logo.split(X, y, groups=groups), scoring = > 'accuracy') > > Everything else is correct. > > > On Fri, December 2, 2016 10:28 pm, Matteo Niccoli wrote: >> HI all, >> >> >> I want to plot learning curves on a trained SVM classifier, using a >> custom scorer, and using Leave One Group Out as the method of >> crossvalidation. I thought I had it figured out, but two different scorers >> - 'f1_micro' and >> 'accuracy' - will yield identical values. I am confused, is that supposed >> to be the case? >> >> Here's my code (unfortunately I cannot share the data as it is not open): >> >> >> from sklearn import svm SVC_classifier_LOWO_VC0 = svm.SVC(cache_size=800, >> class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, >> gamma=0.01, kernel='rbf', max_iter=-1, probability=False, random_state=1, >> shrinking=True, tol=0.001, verbose=False) training_data = >> pd.read_csv('training_data.csv') scaler = >> preprocessing.StandardScaler().fit(X) X = scaler.transform(X) >> y = training_data['Targets'].values groups = training_data["Groups"].values >> Fscorer = make_scorer(f1_score, average = 'micro') >> logo = LeaveOneGroupOut() parm_range0 = np.logspace(-2, 6, 9) > train_scores0, >> test_scores0 = validation_curve(SVC_classifier_LOWO_VC0, X, y, "C", >> parm_range0, cv =logo.split(X, y, groups=groups), scoring = Fscorer) >> >> >> Now, from: >> train_scores_mean0 = np.mean(train_scores0, axis=1) train_scores_std0 = >> np.std(train_scores0, axis=1) test_scores_mean0 = np.mean(test_scores0, >> axis=1) test_scores_std0 = np.std(test_scores0, axis=1) print >> test_scores_mean0 print np.amax(test_scores_mean0) print np.logspace(-2, >> 6, 9)[test_scores_mean0.argmax(axis=0)] >> >> >> I get: >> [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 >> 0.49426622 0.48066419 0.4868987 ] >> 0.502174200206 >> 100.0 >> >> >> If I create a new classifier, but with the same parameters, and run >> everything exactly as before, except for the scoring, e.g.: >> >> parm_range1 = np.logspace(-2, 6, 9) train_scores1, test_scores1 = >> validation_curve(SVC_classifier_LOWO_VC1, X, y, "C", parm_range1, cv >> =logo.split(X, y, groups=wells), scoring = >> 'accuracy') >> train_scores_mean1 = np.mean(train_scores1, axis=1) train_scores_std1= >> np.std(train_scores1, axis=1) test_scores_mean1 = np.mean(test_scores1, >> axis=1) test_scores_std1 = np.std(test_scores1, axis=1) print >> test_scores_mean1 print np.amax(test_scores_mean1) print np.logspace(-2, >> 6, 9)[test_scores_mean1.argmax(axis=0)] >> >> >> I get exactly the same answer: >> [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 >> 0.49426622 0.48066419 0.4868987 ] >> 0.502174200206 >> 100.0 >> >> >> How is that possible, am I doing something wrong, or missing something? >> >> >> Thanks >> >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From nelle.varoquaux at gmail.com Sat Dec 3 13:20:14 2016 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Sat, 3 Dec 2016 10:20:14 -0800 Subject: [scikit-learn] Github project management tools In-Reply-To: <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> References:

<41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>

<20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> Message-ID: On 3 December 2016 at 10:08, Andy wrote: > > > On 12/03/2016 12:26 PM, Raghav R V wrote: >> >> We could start with assigning priority labels like they use in numpy... >> That + milestones could help us prioritize? >> > I feel milestones are too coarse. Or I'm using them wrong. > And priority labels only work if people don't use the "high priority" all > the time. > There is a lot of stuff labeled "bug", which I would interpret as "highest > priority" that people don't look at at all. even milestone only work if people don't use the next milestone all the time. I think the only milestone useful is for release critical bugs, for the next release. For example, on matplotlib, I am currently only reviewing and working on tickets for the 2.0 milestone, as we're hoping to get a new candidate release out this week-end. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Sat Dec 3 14:07:33 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 14:07:33 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References:

<41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>

<20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> Message-ID: On 12/03/2016 01:20 PM, Nelle Varoquaux wrote: > On 3 December 2016 at 10:08, Andy wrote: >> >> On 12/03/2016 12:26 PM, Raghav R V wrote: >>> We could start with assigning priority labels like they use in numpy... >>> That + milestones could help us prioritize? >>> >> I feel milestones are too coarse. Or I'm using them wrong. >> And priority labels only work if people don't use the "high priority" all >> the time. >> There is a lot of stuff labeled "bug", which I would interpret as "highest >> priority" that people don't look at at all. > even milestone only work if people don't use the next milestone all > the time. I think the only milestone useful is for release critical > bugs, for the next release. > For example, on matplotlib, I am currently only reviewing and working > on tickets for the 2.0 milestone, as we're hoping to get a new > candidate release out this week-end. > > That's what I meant by "probably doing it wrong". I assign it to too often. But actually I think people mostly ignore it anyhow ;) From jmschreiber91 at gmail.com Sat Dec 3 14:12:47 2016 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sat, 3 Dec 2016 11:12:47 -0800 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: Message-ID: I don't think anyone is working on this. Contributions are always very welcome, but be aware before you start that the process of getting a completely new algorithm into scikit-learn will take a lot of time and reviews. On Sat, Dec 3, 2016 at 9:19 AM, Allan Visochek wrote: > Hi there, > > My name is Allan Visochek, I'm a data scientist and web developer and I > love scikit-learn so first of all, thanks so much for the work that you do. > > I'm reaching out because I've found the markov clustering algorithm to be > quite useful for me in some of my work and noticed that there is no > implementation in scikit-learn, is anybody working on this? If not, id be > happy to take this on. I'm new to open source, but I've been working with > python for a few years now. > > Best, > -Allan > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alekhka at gmail.com Sat Dec 3 15:10:55 2016 From: alekhka at gmail.com (Alekh Karkada Ashok) Date: Sun, 4 Dec 2016 01:40:55 +0530 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References:

<20161203102926.GG455403@phare.normalesup.org> Message-ID: Hey All, I chose MLP because they were images and I have heard MLPs perform better. My application is detecting body parts from these images and therefore, the mapping would be pretty non-linear and this was my idea behind selecting MLP. Otherwise, I would have to engineer high dimension features by hand. I have 2030400 pixels and making higher dimensional features would require a lot more memory. Where do you want to me to open the issue? GitHub? I don't think the error is only in documentation. Because when Y is [2030400,1] there is no MemoryError (treated as 2030400 samples with a single feature) and when I try to fit [1,2030400] it throws MemoryError. If the case was memory, both should have thrown the error right? I am still a novice but I am fairly good with Python. I am taken aback by scikit's sheer beauty and simplicity. I would love to contribute code to it. Can you please tell me how I can get started? Thanks a lot! On Sat, Dec 3, 2016 at 11:40 PM, Andy wrote: > > > On 12/03/2016 05:29 AM, Gael Varoquaux wrote: > >> On Sat, Dec 03, 2016 at 03:08:00PM +0530, Alekh Karkada Ashok wrote: >> >>> I want use the Scikit-learn's MLPRegressor to map image to image. That >>> is I >>> have a numpy array of size [1000,2030400] (1000 samples, 76800x3 (RGB) >>> pixels). >>> Corresponding labelled images I have. Therefore Y is also [1000,230400]. >>> But >>> according to documentation: >>> >> 1 thousands samples and 2030 thousands features: you are using the wrong >> tool, I multi-layer perceptron model will be too complex and overfit in >> these settings. I would suggest a ridge. >> >> >> These are images! Don't use ridge, use a convolutional neural network. > Our MLP is not convolutional, it will not be useful. > There is a lot of material out there on how to use covolutional neural > networks > for image labeling (it looks like you have one label per pixel, not per > image) > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sat Dec 3 15:34:55 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 15:34:55 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: Message-ID: Hi Allan. Can you provide the original paper? It this something usually used on sparse graphs? We do have algorithms that operate on data-induced graphs, like SpectralClustering, but we don't really implement general graph algorithms (there's no PageRank or community detection). Andy On 12/03/2016 12:19 PM, Allan Visochek wrote: > Hi there, > > My name is Allan Visochek, I'm a data scientist and web developer and > I love scikit-learn so first of all, thanks so much for the work that > you do. > > I'm reaching out because I've found the markov clustering algorithm to > be quite useful for me in some of my work and noticed that there is no > implementation in scikit-learn, is anybody working on this? If not, id > be happy to take this on. I'm new to open source, but I've been > working with python for a few years now. > > Best, > -Allan > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sat Dec 3 15:41:45 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 15:41:45 -0500 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References:

<20161203102926.GG455403@phare.normalesup.org>

Message-ID: On 12/03/2016 03:10 PM, Alekh Karkada Ashok wrote: > > Hey All, > > I chose MLP because they were images and I have heard MLPs perform better. Better than a convolutional neural net? Whoever told you that was wrong. I usually don't make absolute statements like this, but this is something that is pretty certain. > > Where do you want to me to open the issue? GitHub? I don't think the > error is only in documentation. Because when Y is [2030400,1] there is > no MemoryError (treated as 2030400 samples with a single feature) and > when I try to fit [1,2030400] it throws MemoryError. If the case was > memory, both should have thrown the error right? MLPClassifier actually supports multi-label classification (which is not documented correctly and I made an issue here: https://github.com/scikit-learn/scikit-learn/issues/7972) MLPClassifier does not support multi-output (multi-class multi-output), which is probably what you want. From avn at mccme.ru Sat Dec 3 15:39:04 2016 From: avn at mccme.ru (avn at mccme.ru) Date: Sat, 03 Dec 2016 23:39:04 +0300 Subject: [scikit-learn] Adding samplers for intersection/Jensen-Shannon kernels Message-ID: <2379df1fffaf791977177019377b57bc@mccme.ru> Hello, In the course of my work, I've made samplers for intersection/Jensen-Shannon kernels, just by small modifications to sklearn.kernel_approximation.AdditiveChi2Sampler code. Intersection kernel proved to be the best one for my task (clustering Docstrum feature vectors), so perhaps it'd be good to add those samplers alongside AdditiveChi2Sampler? Should I proceed with creating a pull request? Or, perhaps, those kernels were not already included for some good reason? With best regards, -- Valery From t3kcit at gmail.com Sat Dec 3 16:23:21 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 16:23:21 -0500 Subject: [scikit-learn] Adding samplers for intersection/Jensen-Shannon kernels In-Reply-To: <2379df1fffaf791977177019377b57bc@mccme.ru> References: <2379df1fffaf791977177019377b57bc@mccme.ru> Message-ID: Hi Valery. I didn't include them because the Chi2 worked better for my task ;) In hindsight, I'm not sure if these kernels are not to a bit too specialized for scikit-learn. But given that we have the (slightly more obscure) SkewedChi2 and AdditiveChi2, I think the intersection one would be a good addition if you found it useful. Andy On 12/03/2016 03:39 PM, Valery Anisimovsky via scikit-learn wrote: > Hello, > > In the course of my work, I've made samplers for > intersection/Jensen-Shannon kernels, just by small modifications to > sklearn.kernel_approximation.AdditiveChi2Sampler code. Intersection > kernel proved to be the best one for my task (clustering Docstrum > feature vectors), so perhaps it'd be good to add those samplers > alongside AdditiveChi2Sampler? Should I proceed with creating a pull > request? Or, perhaps, those kernels were not already included for some > good reason? > > With best regards, > -- Valery > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From avisochek3 at gmail.com Sat Dec 3 16:33:58 2016 From: avisochek3 at gmail.com (Allan Visochek) Date: Sat, 3 Dec 2016 16:33:58 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References:

Message-ID: Hey Andy, This algorithm does operate on sparse graphs so it may be beyond the scope of sci-kit learn, let me know what you think. The website is here , it includes a brief description of how the algorithm operates under Documentation -> Overview1 and Overview2. The references listed on the website are included below. Best, -Allan [1] Stijn van Dongen. *Graph Clustering by Flow Simulation*. PhD thesis, University of Utrecht, May 2000. http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm [2] Stijn van Dongen. *A cluster algorithm for graphs*. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z [3] Stijn van Dongen. *A stochastic uncoupling process for graphs*. Technical Report INS-R0011, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z [4] Stijn van Dongen. *Performance criteria for graph clustering and Markov cluster experiments*. Technical Report INS-R0012, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z [5] Enright A.J., Van Dongen S., Ouzounis C.A. *An efficient algorithm for large-scale detection of protein families*, Nucleic Acids Research 30(7):1575-1584 (2002). On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have algorithms > that operate on data-induced > graphs, like SpectralClustering, but we don't really implement general > graph algorithms (there's no PageRank or community detection). > > Andy > > > On 12/03/2016 12:19 PM, Allan Visochek wrote: > > Hi there, > > My name is Allan Visochek, I'm a data scientist and web developer and I > love scikit-learn so first of all, thanks so much for the work that you do. > > I'm reaching out because I've found the markov clustering algorithm to be > quite useful for me in some of my work and noticed that there is no > implementation in scikit-learn, is anybody working on this? If not, id be > happy to take this on. I'm new to open source, but I've been working with > python for a few years now. > > Best, > -Allan > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sat Dec 3 16:45:02 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 16:45:02 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References:

Message-ID: Hey Allan. None of the references apart from the last one seems to be published in a peer-reviewed place, is that right? And "A stochastic uncoupling process for graphs" has 13 citations since 2000. Unless there is a more prominent publication or evidence of heavy use, I think it's disqualified. Academia is certainly not the only metric for evaluation, so if you have others, that's good, too ;) Best, Andy On 12/03/2016 04:33 PM, Allan Visochek wrote: > Hey Andy, > > This algorithm does operate on sparse graphs so it may be beyond the > scope of sci-kit learn, let me know what you think. > The website is here , it includes a brief > description of how the algorithm operates under Documentation -> > Overview1 and Overview2. > The references listed on the website are included below. > > Best, > -Allan > > [1] Stijn van Dongen. /Graph Clustering by Flow Simulation/. PhD > thesis, University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > > > [2] Stijn van Dongen. /A cluster algorithm for graphs/. Technical > Report INS-R0010, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > > > [3] Stijn van Dongen. /A stochastic uncoupling process for graphs/. > Technical Report INS-R0011, National Research Institute for > Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > > > [4] Stijn van Dongen. /Performance criteria for graph clustering and > Markov cluster experiments/. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the > Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > > > [5] Enright A.J., Van Dongen S., Ouzounis C.A. /An efficient algorithm > for large-scale detection of protein families/, Nucleic Acids Research > 30(7):1575-1584 (2002). > > > On Sat, Dec 3, 2016 at 3:34 PM, Andy > wrote: > > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have > algorithms that operate on data-induced > graphs, like SpectralClustering, but we don't really implement > general graph algorithms (there's no PageRank or community detection). > > Andy > > > On 12/03/2016 12:19 PM, Allan Visochek wrote: >> Hi there, >> >> My name is Allan Visochek, I'm a data scientist and web developer >> and I love scikit-learn so first of all, thanks so much for the >> work that you do. >> >> I'm reaching out because I've found the markov clustering >> algorithm to be quite useful for me in some of my work and >> noticed that there is no implementation in scikit-learn, is >> anybody working on this? If not, id be happy to take this on. I'm >> new to open source, but I've been working with python for a few >> years now. >> >> Best, >> -Allan >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From alekhka at gmail.com Sat Dec 3 17:00:04 2016 From: alekhka at gmail.com (Alekh Karkada Ashok) Date: Sun, 4 Dec 2016 03:30:04 +0530 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References:

<20161203102926.GG455403@phare.normalesup.org>

Message-ID: No, I am not saying it is better than CNN, but my images aren't real-life images but computer generated silhouettes. So CNN seemed to be overkill. I'll revisit CNN. I resized the images and converted it to grayscale. Now I am feeding [1,4800] now and I am getting good output with MLP. I looped over all my images and used partial_fit to train each one. I didn't get what you meant by MLPClassifier doesn't support multi-output. Thanks for the help! On Sun, Dec 4, 2016 at 2:11 AM, Andy wrote: > > > On 12/03/2016 03:10 PM, Alekh Karkada Ashok wrote: > >> >> Hey All, >> >> I chose MLP because they were images and I have heard MLPs perform better. >> > Better than a convolutional neural net? Whoever told you that was wrong. I > usually don't make absolute statements like this, but this is something > that is pretty certain. > > >> Where do you want to me to open the issue? GitHub? I don't think the >> error is only in documentation. Because when Y is [2030400,1] there is no >> MemoryError (treated as 2030400 samples with a single feature) and when I >> try to fit [1,2030400] it throws MemoryError. If the case was memory, both >> should have thrown the error right? >> > MLPClassifier actually supports multi-label classification (which is not > documented correctly and I made an issue here: > https://github.com/scikit-learn/scikit-learn/issues/7972) > MLPClassifier does not support multi-output (multi-class multi-output), > which is probably what you want. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Sat Dec 3 17:35:03 2016 From: vaggi.federico at gmail.com (federico vaggi) Date: Sat, 03 Dec 2016 22:35:03 +0000 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References:

<20161203102926.GG455403@phare.normalesup.org>

Message-ID: As long as the feature ordering has a meaningful spatial component (as is almost always the case when you are dealing with raw pixels as features) CNNs will almost always be better. CNNs actually have a lot fewer parameters than MLPs (depending on architecture of course) because of weight sharing among the parameters of the convolutional kernel within a feature map. On Sat, 3 Dec 2016 at 23:00 Alekh Karkada Ashok wrote: > No, I am not saying it is better than CNN, but my images aren't real-life > images but computer generated silhouettes. So CNN seemed to be overkill. > I'll revisit CNN. I resized the images and converted it to grayscale. Now I > am feeding [1,4800] now and I am getting good output with MLP. I looped > over all my images and used partial_fit to train each one. > I didn't get what you meant by MLPClassifier doesn't support multi-output. > Thanks for the help! > > On Sun, Dec 4, 2016 at 2:11 AM, Andy wrote: > > > > On 12/03/2016 03:10 PM, Alekh Karkada Ashok wrote: > > > Hey All, > > I chose MLP because they were images and I have heard MLPs perform better. > > Better than a convolutional neural net? Whoever told you that was wrong. I > usually don't make absolute statements like this, but this is something > that is pretty certain. > > > Where do you want to me to open the issue? GitHub? I don't think the error > is only in documentation. Because when Y is [2030400,1] there is no > MemoryError (treated as 2030400 samples with a single feature) and when I > try to fit [1,2030400] it throws MemoryError. If the case was memory, both > should have thrown the error right? > > MLPClassifier actually supports multi-label classification (which is not > documented correctly and I made an issue here: > https://github.com/scikit-learn/scikit-learn/issues/7972) > MLPClassifier does not support multi-output (multi-class multi-output), > which is probably what you want. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From avisochek3 at gmail.com Sat Dec 3 17:43:30 2016 From: avisochek3 at gmail.com (Allan Visochek) Date: Sat, 3 Dec 2016 17:43:30 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References:

Message-ID: Thanks for pointing that out, I sort of picked it up by word of mouth so I'd assumed it had a bit more precedence in the academic world. I'll look into it a little more, but I'd definitely be interested in contributing something else if that doesn't work out. -Allan On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > Hey Allan. > > None of the references apart from the last one seems to be published in a > peer-reviewed place, is that right? > And "A stochastic uncoupling process for graphs" has 13 citations since > 2000. Unless there is a more prominent > publication or evidence of heavy use, I think it's disqualified. > Academia is certainly not the only metric for evaluation, so if you have > others, that's good, too ;) > > Best, > Andy > > On 12/03/2016 04:33 PM, Allan Visochek wrote: > > Hey Andy, > > This algorithm does operate on sparse graphs so it may be beyond the scope > of sci-kit learn, let me know what you think. > The website is here , it includes a brief > description of how the algorithm operates under Documentation -> Overview1 > and Overview2. > The references listed on the website are included below. > > Best, > -Allan > > [1] Stijn van Dongen. *Graph Clustering by Flow Simulation*. PhD thesis, > University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > > [2] Stijn van Dongen. *A cluster algorithm for graphs*. Technical Report > INS-R0010, National Research Institute for Mathematics and Computer Science > in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > > [3] Stijn van Dongen. *A stochastic uncoupling process for graphs*. > Technical Report INS-R0011, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > > [4] Stijn van Dongen. *Performance criteria for graph clustering and > Markov cluster experiments*. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the Netherlands, > Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > > [5] Enright A.J., Van Dongen S., Ouzounis C.A. *An efficient algorithm > for large-scale detection of protein families*, Nucleic Acids Research > 30(7):1575-1584 (2002). > > On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > >> Hi Allan. >> Can you provide the original paper? >> It this something usually used on sparse graphs? We do have algorithms >> that operate on data-induced >> graphs, like SpectralClustering, but we don't really implement general >> graph algorithms (there's no PageRank or community detection). >> >> Andy >> >> >> On 12/03/2016 12:19 PM, Allan Visochek wrote: >> >> Hi there, >> >> My name is Allan Visochek, I'm a data scientist and web developer and I >> love scikit-learn so first of all, thanks so much for the work that you do. >> >> I'm reaching out because I've found the markov clustering algorithm to be >> quite useful for me in some of my work and noticed that there is no >> implementation in scikit-learn, is anybody working on this? If not, id be >> happy to take this on. I'm new to open source, but I've been working with >> python for a few years now. >> >> Best, >> -Allan >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ scikit-learn mailing >> list scikit-learn at python.org https://mail.python.org/mailma >> n/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Sun Dec 4 03:18:54 2016 From: drraph at gmail.com (Raphael C) Date: Sun, 04 Dec 2016 08:18:54 +0000 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References:

Message-ID: I think you get a better view of the importance of Markov Clustering in academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q=Markov+clustering . Raphael On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > Thanks for pointing that out, I sort of picked it up by word of mouth so > I'd assumed it had a bit more precedence in the academic world. > > I'll look into it a little more, but I'd definitely be interested in > contributing something else if that doesn't work out. > > -Allan > > On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > > Hey Allan. > > None of the references apart from the last one seems to be published in a > peer-reviewed place, is that right? > And "A stochastic uncoupling process for graphs" has 13 citations since > 2000. Unless there is a more prominent > publication or evidence of heavy use, I think it's disqualified. > Academia is certainly not the only metric for evaluation, so if you have > others, that's good, too ;) > > Best, > Andy > > On 12/03/2016 04:33 PM, Allan Visochek wrote: > > Hey Andy, > > This algorithm does operate on sparse graphs so it may be beyond the scope > of sci-kit learn, let me know what you think. > The website is here , it includes a brief > description of how the algorithm operates under Documentation -> Overview1 > and Overview2. > The references listed on the website are included below. > > Best, > -Allan > > [1] Stijn van Dongen. *Graph Clustering by Flow Simulation*. PhD thesis, > University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > > [2] Stijn van Dongen. *A cluster algorithm for graphs*. Technical Report > INS-R0010, National Research Institute for Mathematics and Computer Science > in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > > [3] Stijn van Dongen. *A stochastic uncoupling process for graphs*. > Technical Report INS-R0011, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > > [4] Stijn van Dongen. *Performance criteria for graph clustering and > Markov cluster experiments*. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the Netherlands, > Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > > [5] Enright A.J., Van Dongen S., Ouzounis C.A. *An efficient algorithm > for large-scale detection of protein families*, Nucleic Acids Research > 30(7):1575-1584 (2002). > > On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have algorithms > that operate on data-induced > graphs, like SpectralClustering, but we don't really implement general > graph algorithms (there's no PageRank or community detection). > > Andy > > > On 12/03/2016 12:19 PM, Allan Visochek wrote: > > Hi there, > > My name is Allan Visochek, I'm a data scientist and web developer and I > love scikit-learn so first of all, thanks so much for the work that you do. > > I'm reaching out because I've found the markov clustering algorithm to be > quite useful for me in some of my work and noticed that there is no > implementation in scikit-learn, is anybody working on this? If not, id be > happy to take this on. I'm new to open source, but I've been working with > python for a few years now. > > Best, > -Allan > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Sun Dec 4 11:16:32 2016 From: ragvrv at gmail.com (Raghav R V) Date: Sun, 4 Dec 2016 17:16:32 +0100 Subject: [scikit-learn] Github project management tools In-Reply-To: References:

<41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>

<20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com>

Message-ID: > > Okay so in the project, instead of sorting them by Issues / PR why don't >>> we make one column per priority. Let's have 3 levels and one column for >>> Done. We have a label for "Stalled" / "Need Contributor" which shows up in >>> the cards of the project anyway... >>> >>> As I didn't want to disturb the existing project setup, I created one >>> for a demo - https://github.com/scikit-learn/scikit-learn/projects/7 >>> (I'm resending this e-mail as the last one was rejected because the >>> attached image was huge for the mailing list) >>> >> Thanks Raghav -------------- next part -------------- An HTML attachment was scrubbed... URL: From ludo25_90 at hotmail.com Sun Dec 4 15:12:29 2016 From: ludo25_90 at hotmail.com (Ludovico Coletta) Date: Sun, 4 Dec 2016 20:12:29 +0000 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: Dear scikit experts, I'm struggling with the implementation of a nested cross validation. My data: I have 26 subjects (13 per class) x 6670 features. I used a feature reduction algorithm (you may have heard about Boruta) to reduce the dimensionality of my data. Problems start now: I defined LOSO as outer partitioning schema. Therefore, for each of the 26 cv folds I used 24 subjects for feature reduction. This lead to a different number of features in each cv fold. Now, for each cv fold I would like to use the same 24 subjects for hyperparameter optimization (SVM with rbf kernel). This is what I did: cv = list(LeaveOneout(len(y))) # in y I stored the labels inner_train = [None] * len(y) inner_test = [None] * len(y) ii = 0 while ii < len(y): cv = list(LeaveOneOut(len(y))) a = cv[ii][0] a = a[:-1] inner_train[ii] = a b = cv[ii][0] b = np.array(b[((len(cv[0][0]))-1)]) inner_test[ii]=b ii = ii + 1 custom_cv = zip(inner_train,inner_test) # inner cv pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', SVC(kernel="rbf"))]) parameters = [{'clf__C': np.logspace(-2, 10, 13), 'clf__gamma':np.logspace(-9, 3, 13)}] scores = [None] * (len(y)) ii = 0 while ii < len(scores): a = data[ii][0] # data for train b = data[ii][1] # data for test c = np.concatenate((a,b)) # shape: number of subjects * number of features d = cv[ii][0] # labels for train e = cv[ii][1] # label for test f = np.concatenate((d,e)) grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='accuracy', cv= zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]]))) scores[ii] = cross_validation.cross_val_score(grid_search, c, y[f], scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]]))) ii = ii + 1 However, I got the following error message: index 25 is out of bounds for size 25 Would it be so bad if I do not perform a nested LOSO but I use the default setting for hyperparameter optimization? Any help would be really appreciated -------------- next part -------------- An HTML attachment was scrubbed... URL: From avn at mccme.ru Sun Dec 4 15:50:21 2016 From: avn at mccme.ru (avn at mccme.ru) Date: Sun, 04 Dec 2016 23:50:21 +0300 Subject: [scikit-learn] Adding samplers for intersection/Jensen-Shannon kernels In-Reply-To: References: <2379df1fffaf791977177019377b57bc@mccme.ru> Message-ID: <0511d5fa33737f78ccdf7fbb2e5b2156@mccme.ru> I see now. So I'll proceed with adding documentation and unit tests for those kernels to complete their support. And I don't think they're too specialized, given that many kinds of feature vectors in e.g. computer vision are in fact histograms and all of those kernels are histogram-oriented. Andy ????? 2016-12-04 00:23: > Hi Valery. > I didn't include them because the Chi2 worked better for my task ;) > In hindsight, I'm not sure if these kernels are not to a bit too > specialized for scikit-learn. > But given that we have the (slightly more obscure) SkewedChi2 and > AdditiveChi2, > I think the intersection one would be a good addition if you found it > useful. > > Andy > > On 12/03/2016 03:39 PM, Valery Anisimovsky via scikit-learn wrote: >> Hello, >> >> In the course of my work, I've made samplers for >> intersection/Jensen-Shannon kernels, just by small modifications to >> sklearn.kernel_approximation.AdditiveChi2Sampler code. Intersection >> kernel proved to be the best one for my task (clustering Docstrum >> feature vectors), so perhaps it'd be good to add those samplers >> alongside AdditiveChi2Sampler? Should I proceed with creating a pull >> request? Or, perhaps, those kernels were not already included for some >> good reason? >> >> With best regards, >> -- Valery >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ragvrv at gmail.com Sun Dec 4 16:27:02 2016 From: ragvrv at gmail.com (Raghav R V) Date: Sun, 4 Dec 2016 22:27:02 +0100 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: Hi! It looks like you are using the old `sklearn.cross_validation`'s LeaveOneLabelOut cross-validator. It has been deprecated since v0.18. Use the `LeaveOneLabelOut` from `sklearn.model_selection`, that should fix your issue I think (thought I have not looked into your code in detail). HTH! On Sun, Dec 4, 2016 at 9:12 PM, Ludovico Coletta wrote: > Dear scikit experts, > > I'm struggling with the implementation of a nested cross validation. > > My data: I have 26 subjects (13 per class) x 6670 features. I used a > feature reduction algorithm (you may have heard about Boruta) to reduce the > dimensionality of my data. Problems start now: I defined LOSO as outer > partitioning schema. Therefore, for each of the 26 cv folds I used 24 > subjects for feature reduction. This lead to a different number of features > in each cv fold. Now, for each cv fold I would like to use the same 24 > subjects for hyperparameter optimization (SVM with rbf kernel). > > This is what I did: > > *cv = list(LeaveOneout(len(y))) # in y I stored the labels* > > *inner_train = [None] * len(y)* > > *inner_test = [None] * len(y)* > > *ii = 0* > > *while ii < len(y):* > * cv = list(LeaveOneOut(len(y))) * > * a = cv[ii][0]* > * a = a[:-1]* > * inner_train[ii] = a* > > * b = cv[ii][0]* > * b = np.array(b[((len(cv[0][0]))-1)])* > * inner_test[ii]=b* > > * ii = ii + 1* > > *custom_cv = zip(inner_train,inner_test) # inner cv* > > > *pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', > SVC(kernel="rbf"))])* > > *parameters = [{'clf__C': np.logspace(-2, 10, 13), > 'clf__gamma':np.logspace(-9, 3, 13)}]* > > > > *scores = [None] * (len(y)) * > > *ii = 0* > > *while ii < len(scores):* > > * a = data[ii][0] # data for train* > * b = data[ii][1] # data for test* > * c = np.concatenate((a,b)) # shape: number of subjects * number of > features* > * d = cv[ii][0] # labels for train* > * e = cv[ii][1] # label for test* > * f = np.concatenate((d,e))* > > * grid_search = GridSearchCV(estimator=pipe_logistic, > param_grid=parameters, verbose=1, scoring='accuracy', cv= > zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]])))* > > * scores[ii] = cross_validation.cross_val_score(grid_search, c, y[f], > scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]])))* > > * ii = ii + 1* > > > > However, I got the following error message: index 25 is out of bounds for > size 25 > > Would it be so bad if I do not perform a nested LOSO but I use the default > setting for hyperparameter optimization? > > Any help would be really appreciated > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From ludo25_90 at hotmail.com Mon Dec 5 08:39:40 2016 From: ludo25_90 at hotmail.com (Ludovico Coletta) Date: Mon, 5 Dec 2016 13:39:40 +0000 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: Unfortunately, it did not work. I think I am doing something wrong when passing the nested cv, but I do not understand where. If I omit the cv argument in the grid search it runs smoothly. I would like to have LeaveOneOut in both the outer and inner cv, how would you implement such a thing? Best Ludovico ________________________________ Da: scikit-learn per conto di scikit-learn-request at python.org Inviato: domenica 4 dicembre 2016 22.27 A: scikit-learn at python.org Oggetto: scikit-learn Digest, Vol 9, Issue 13 Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Nested Leave One Subject Out (LOSO) cross validation with scikit (Ludovico Coletta) 2. Re: Adding samplers for intersection/Jensen-Shannon kernels (avn at mccme.ru) 3. Re: Nested Leave One Subject Out (LOSO) cross validation with scikit (Raghav R V) ---------------------------------------------------------------------- Message: 1 Date: Sun, 4 Dec 2016 20:12:29 +0000 From: Ludovico Coletta To: "scikit-learn at python.org" Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: Content-Type: text/plain; charset="iso-8859-1" Dear scikit experts, I'm struggling with the implementation of a nested cross validation. My data: I have 26 subjects (13 per class) x 6670 features. I used a feature reduction algorithm (you may have heard about Boruta) to reduce the dimensionality of my data. Problems start now: I defined LOSO as outer partitioning schema. Therefore, for each of the 26 cv folds I used 24 subjects for feature reduction. This lead to a different number of features in each cv fold. Now, for each cv fold I would like to use the same 24 subjects for hyperparameter optimization (SVM with rbf kernel). This is what I did: cv = list(LeaveOneout(len(y))) # in y I stored the labels inner_train = [None] * len(y) inner_test = [None] * len(y) ii = 0 while ii < len(y): cv = list(LeaveOneOut(len(y))) a = cv[ii][0] a = a[:-1] inner_train[ii] = a b = cv[ii][0] b = np.array(b[((len(cv[0][0]))-1)]) inner_test[ii]=b ii = ii + 1 custom_cv = zip(inner_train,inner_test) # inner cv pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', SVC(kernel="rbf"))]) parameters = [{'clf__C': np.logspace(-2, 10, 13), 'clf__gamma':np.logspace(-9, 3, 13)}] scores = [None] * (len(y)) ii = 0 while ii < len(scores): a = data[ii][0] # data for train b = data[ii][1] # data for test c = np.concatenate((a,b)) # shape: number of subjects * number of features d = cv[ii][0] # labels for train e = cv[ii][1] # label for test f = np.concatenate((d,e)) grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='accuracy', cv= zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]]))) scores[ii] = cross_validation.cross_val_score(grid_search, c, y[f], scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]]))) ii = ii + 1 However, I got the following error message: index 25 is out of bounds for size 25 Would it be so bad if I do not perform a nested LOSO but I use the default setting for hyperparameter optimization? Any help would be really appreciated -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Sun, 04 Dec 2016 23:50:21 +0300 From: avn at mccme.ru To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Adding samplers for intersection/Jensen-Shannon kernels Message-ID: <0511d5fa33737f78ccdf7fbb2e5b2156 at mccme.ru> Content-Type: text/plain; charset=UTF-8; format=flowed I see now. So I'll proceed with adding documentation and unit tests for those kernels to complete their support. And I don't think they're too specialized, given that many kinds of feature vectors in e.g. computer vision are in fact histograms and all of those kernels are histogram-oriented. Andy ????? 2016-12-04 00:23: > Hi Valery. > I didn't include them because the Chi2 worked better for my task ;) > In hindsight, I'm not sure if these kernels are not to a bit too > specialized for scikit-learn. > But given that we have the (slightly more obscure) SkewedChi2 and > AdditiveChi2, > I think the intersection one would be a good addition if you found it > useful. > > Andy > > On 12/03/2016 03:39 PM, Valery Anisimovsky via scikit-learn wrote: >> Hello, >> >> In the course of my work, I've made samplers for >> intersection/Jensen-Shannon kernels, just by small modifications to >> sklearn.kernel_approximation.AdditiveChi2Sampler code. Intersection >> kernel proved to be the best one for my task (clustering Docstrum >> feature vectors), so perhaps it'd be good to add those samplers >> alongside AdditiveChi2Sampler? Should I proceed with creating a pull >> request? Or, perhaps, those kernels were not already included for some >> good reason? >> >> With best regards, >> -- Valery >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... ------------------------------ Message: 3 Date: Sun, 4 Dec 2016 22:27:02 +0100 From: Raghav R V To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: Content-Type: text/plain; charset="utf-8" Hi! It looks like you are using the old `sklearn.cross_validation`'s LeaveOneLabelOut cross-validator. It has been deprecated since v0.18. Use the `LeaveOneLabelOut` from `sklearn.model_selection`, that should fix your issue I think (thought I have not looked into your code in detail). HTH! On Sun, Dec 4, 2016 at 9:12 PM, Ludovico Coletta wrote: > Dear scikit experts, > > I'm struggling with the implementation of a nested cross validation. > > My data: I have 26 subjects (13 per class) x 6670 features. I used a > feature reduction algorithm (you may have heard about Boruta) to reduce the > dimensionality of my data. Problems start now: I defined LOSO as outer > partitioning schema. Therefore, for each of the 26 cv folds I used 24 > subjects for feature reduction. This lead to a different number of features > in each cv fold. Now, for each cv fold I would like to use the same 24 > subjects for hyperparameter optimization (SVM with rbf kernel). > > This is what I did: > > *cv = list(LeaveOneout(len(y))) # in y I stored the labels* > > *inner_train = [None] * len(y)* > > *inner_test = [None] * len(y)* > > *ii = 0* > > *while ii < len(y):* > * cv = list(LeaveOneOut(len(y))) * > * a = cv[ii][0]* > * a = a[:-1]* > * inner_train[ii] = a* > > * b = cv[ii][0]* > * b = np.array(b[((len(cv[0][0]))-1)])* > * inner_test[ii]=b* > > * ii = ii + 1* > > *custom_cv = zip(inner_train,inner_test) # inner cv* > > > *pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', > SVC(kernel="rbf"))])* > > *parameters = [{'clf__C': np.logspace(-2, 10, 13), > 'clf__gamma':np.logspace(-9, 3, 13)}]* > > > > *scores = [None] * (len(y)) * > > *ii = 0* > > *while ii < len(scores):* > > * a = data[ii][0] # data for train* > * b = data[ii][1] # data for test* > * c = np.concatenate((a,b)) # shape: number of subjects * number of > features* > * d = cv[ii][0] # labels for train* > * e = cv[ii][1] # label for test* > * f = np.concatenate((d,e))* > > * grid_search = GridSearchCV(estimator=pipe_logistic, > param_grid=parameters, verbose=1, scoring='accuracy', cv= > zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]])))* > > * scores[ii] = cross_validation.cross_val_score(grid_search, c, y[f], > scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]])))* > > * ii = ii + 1* > > > > However, I got the following error message: index 25 is out of bounds for > size 25 > > Would it be so bad if I do not perform a nested LOSO but I use the default > setting for hyperparameter optimization? > > Any help would be really appreciated > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > -- Raghav RV https://github.com/raghavrv [https://avatars2.githubusercontent.com/u/9487348?v=3&s=400] raghavrv (Raghav RV) ? GitHub github.com raghavrv has 18 repositories available. Follow their code on GitHub. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... ------------------------------ End of scikit-learn Digest, Vol 9, Issue 13 ******************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Dec 5 08:45:52 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 5 Dec 2016 14:45:52 +0100 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References:

Message-ID: <20161205134552.GK2327874@phare.normalesup.org> Interestingly, a couple of days before this thread was started a researcher in a top lab of a huge private-sector company had mentionned to me that they found this algorithm very useful in practice (sorry for taking time to point this out, I just needed to check with him that indeed it was this specific algorithm). G On Sun, Dec 04, 2016 at 08:18:54AM +0000, Raphael C wrote: > I think you get a better view of the importance of Markov Clustering in > academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q= > Markov+clustering . > Raphael > On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > Thanks for pointing that out, I sort of picked it up by word of mouth so > I'd assumed it had a bit more precedence in the academic world. ? > I'll look into it a little more, but I'd definitely be interested in > contributing something else if that doesn't work out. > -Allan > On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > Hey Allan. > None of the references apart from the last one seems to be published in > a peer-reviewed place, is that right? > And "A stochastic uncoupling process for graphs" has 13 citations since > 2000. Unless there is a more prominent > publication or evidence of heavy use, I think it's disqualified. > Academia is certainly not the only metric for evaluation, so if you > have others, that's good, too ;) > Best, > Andy > On 12/03/2016 04:33 PM, Allan Visochek wrote: > Hey Andy, > This algorithm does operate on sparse graphs so it may be beyond > the scope of sci-kit learn, let me know what you think.? > The website is here, it includes a brief description of how the > algorithm operates under Documentation -> Overview1 and Overview2.? > The references listed on the website are included below. > Best, > -Allan > [1]?Stijn van Dongen.?Graph Clustering by Flow Simulation. PhD > thesis, University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > [2]?Stijn van Dongen.?A cluster algorithm for graphs. Technical > Report INS-R0010, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > [3]?Stijn van Dongen.?A stochastic uncoupling process for graphs. > Technical Report INS-R0011, National Research Institute for > Mathematics and Computer Science in the Netherlands, Amsterdam, May > 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > [4]?Stijn van Dongen.?Performance criteria for graph clustering and > Markov cluster experiments. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the > Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > [5]?Enright A.J., Van Dongen S., Ouzounis C.A.?An efficient > algorithm for large-scale detection of protein families, Nucleic > Acids Research 30(7):1575-1584 (2002). > On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have > algorithms that operate on data-induced > graphs, like SpectralClustering, but we don't really implement > general graph algorithms (there's no PageRank or community > detection). > Andy > On 12/03/2016 12:19 PM, Allan Visochek wrote: > Hi there, > My name is Allan Visochek, I'm a data scientist and web > developer and I love scikit-learn so first of all, thanks > so much for the work that you do.? > I'm reaching out because I've found the markov clustering > algorithm to be quite useful for me in some of my work and > noticed that there is no implementation in scikit-learn, is > anybody working on this? If not, id be happy to take this > on. I'm new to open source, but I've been working with > python for a few years now.? > Best, > -Allan > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org https://mail.python.org/ > mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From drraph at gmail.com Mon Dec 5 08:51:26 2016 From: drraph at gmail.com (Raphael C) Date: Mon, 5 Dec 2016 13:51:26 +0000 Subject: [scikit-learn] Markov Clustering? In-Reply-To: <20161205134552.GK2327874@phare.normalesup.org> References:

<20161205134552.GK2327874@phare.normalesup.org> Message-ID: And... [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm has 1201 citations. I think it's fair to say the method is very widely known and used. Raphael On 5 December 2016 at 13:45, Gael Varoquaux wrote: > Interestingly, a couple of days before this thread was started a > researcher in a top lab of a huge private-sector company had mentionned > to me that they found this algorithm very useful in practice (sorry for > taking time to point this out, I just needed to check with him that > indeed it was this specific algorithm). > > G > > On Sun, Dec 04, 2016 at 08:18:54AM +0000, Raphael C wrote: >> I think you get a better view of the importance of Markov Clustering in >> academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q= >> Markov+clustering . > >> Raphael > >> On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > >> Thanks for pointing that out, I sort of picked it up by word of mouth so >> I'd assumed it had a bit more precedence in the academic world. > >> I'll look into it a little more, but I'd definitely be interested in >> contributing something else if that doesn't work out. > >> -Allan > >> On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > >> Hey Allan. > >> None of the references apart from the last one seems to be published in >> a peer-reviewed place, is that right? >> And "A stochastic uncoupling process for graphs" has 13 citations since >> 2000. Unless there is a more prominent >> publication or evidence of heavy use, I think it's disqualified. >> Academia is certainly not the only metric for evaluation, so if you >> have others, that's good, too ;) > >> Best, >> Andy > >> On 12/03/2016 04:33 PM, Allan Visochek wrote: > >> Hey Andy, > >> This algorithm does operate on sparse graphs so it may be beyond >> the scope of sci-kit learn, let me know what you think. >> The website is here, it includes a brief description of how the >> algorithm operates under Documentation -> Overview1 and Overview2. >> The references listed on the website are included below. > >> Best, >> -Allan > > >> [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD >> thesis, University of Utrecht, May 2000. >> http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > >> [2] Stijn van Dongen. A cluster algorithm for graphs. Technical >> Report INS-R0010, National Research Institute for Mathematics and >> Computer Science in the Netherlands, Amsterdam, May 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > >> [3] Stijn van Dongen. A stochastic uncoupling process for graphs. >> Technical Report INS-R0011, National Research Institute for >> Mathematics and Computer Science in the Netherlands, Amsterdam, May >> 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > >> [4] Stijn van Dongen. Performance criteria for graph clustering and >> Markov cluster experiments. Technical Report INS-R0012, National >> Research Institute for Mathematics and Computer Science in the >> Netherlands, Amsterdam, May 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > >> [5] Enright A.J., Van Dongen S., Ouzounis C.A. An efficient >> algorithm for large-scale detection of protein families, Nucleic >> Acids Research 30(7):1575-1584 (2002). > > >> On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > >> Hi Allan. >> Can you provide the original paper? >> It this something usually used on sparse graphs? We do have >> algorithms that operate on data-induced >> graphs, like SpectralClustering, but we don't really implement >> general graph algorithms (there's no PageRank or community >> detection). > >> Andy > > >> On 12/03/2016 12:19 PM, Allan Visochek wrote: > >> Hi there, > >> My name is Allan Visochek, I'm a data scientist and web >> developer and I love scikit-learn so first of all, thanks >> so much for the work that you do. > >> I'm reaching out because I've found the markov clustering >> algorithm to be quite useful for me in some of my work and >> noticed that there is no implementation in scikit-learn, is >> anybody working on this? If not, id be happy to take this >> on. I'm new to open source, but I've been working with >> python for a few years now. > >> Best, >> -Allan > > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > >> _______________________________________________ scikit-learn >> mailing list scikit-learn at python.org https://mail.python.org/ >> mailman/listinfo/scikit-learn > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Mon Dec 5 08:51:47 2016 From: t3kcit at gmail.com (Andy) Date: Mon, 5 Dec 2016 08:51:47 -0500 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References:

Message-ID: <2864ea23-e6ca-cf83-599f-f8ec149e8d67@gmail.com> On 12/04/2016 04:27 PM, Raghav R V wrote: > Hi! > > It looks like you are using the old `sklearn.cross_validation`'s > LeaveOneLabelOut cross-validator. It has been deprecated since v0.18. > > Use the `LeaveOneLabelOut` from `sklearn.model_selection`, that should > fix your issue I think (thought I have not looked into your code in > detail). > You mean LeaveOneGroupOut, right? From t3kcit at gmail.com Mon Dec 5 08:54:01 2016 From: t3kcit at gmail.com (Andy) Date: Mon, 5 Dec 2016 08:54:01 -0500 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: <6e970af0-faeb-bf81-3e9f-28dcc5df9168@gmail.com> I'm not sure what the issue with your custom CV is but this seems like a complicated way to implement this. Try model_selection.LeaveOneGroupOut, which directly implements LOSO On 12/04/2016 03:12 PM, Ludovico Coletta wrote: > Dear scikit experts, > > I'm struggling with the implementation of a nested cross validation. > > My data: I have 26 subjects (13 per class) x 6670 features. I used a > feature reduction algorithm (you may have heard about Boruta) to > reduce the dimensionality of my data. Problems start now: I defined > LOSO as outer partitioning schema. Therefore, for each of the 26 cv > folds I used 24 subjects for feature reduction. This lead to a > different number of features in each cv fold. Now, for each cv fold I > would like to use the same 24 subjects for hyperparameter optimization > (SVM with rbf kernel). > > This is what I did: > > /cv = list(LeaveOneout(len(y))) # in y I stored the labels/ > / > / > /inner_train = [None] * len(y)/ > / > / > /inner_test = [None] * len(y)/ > / > / > /ii = 0/ > / > / > /while ii < len(y):/ > / cv = list(LeaveOneOut(len(y))) / > / a = cv[ii][0]/ > / a = a[:-1]/ > / inner_train[ii] = a/ > / > / > / b = cv[ii][0]/ > / b = np.array(b[((len(cv[0][0]))-1)])/ > / inner_test[ii]=b/ > / > / > / ii = ii + 1/ > / > / > /custom_cv = zip(inner_train,inner_test) # inner cv/ > / > / > / > / > /pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', > SVC(kernel="rbf"))])/ > / > / > /parameters = [{'clf__C': np.logspace(-2, 10, 13), > 'clf__gamma':np.logspace(-9, 3, 13)}]/ > / > / > / > / > / > / > /scores = [None] * (len(y)) / > / > / > /ii = 0/ > / > / > /while ii < len(scores):/ > / > / > / a = data[ii][0] # data for train/ > / b = data[ii][1] # data for test/ > / c = np.concatenate((a,b)) # shape: number of subjects * number of > features/ > / d = cv[ii][0] # labels for train/ > / e = cv[ii][1] # label for test/ > / f = np.concatenate((d,e))/ > / > / > / grid_search = GridSearchCV(estimator=pipe_logistic, > param_grid=parameters, verbose=1, scoring='accuracy', cv= > zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]])))/ > / > / > / scores[ii] = cross_validation.cross_val_score(grid_search, c, > y[f], scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]])))/ > / > / > / ii = ii + 1/ > However, I got the following error message: index 25 is out of bounds > for size 25 > > Would it be so bad if I do not perform a nested LOSO but I use the > default setting for hyperparameter optimization? > > Any help would be really appreciated > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Dec 5 08:57:08 2016 From: t3kcit at gmail.com (Andy) Date: Mon, 5 Dec 2016 08:57:08 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References:

<20161205134552.GK2327874@phare.normalesup.org> Message-ID: On 12/05/2016 08:51 AM, Raphael C wrote: > And... > > [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD > thesis, University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > > has > > 1201 citations. > > I think it's fair to say the method is very widely known and used. > Ok cool. I haven't looked at it, my question is now whether this is more of a "graph clustering" or a "data clustering" approach, though that distinction is not very clear. Some of the papers compare it against affinity propagation, which we do have implemented. If this algorithm makes sense for knn graphs or similar methods we implemented in SpectralClustering, then I guess go for it? From ragvrv at gmail.com Mon Dec 5 09:19:55 2016 From: ragvrv at gmail.com (Raghav R V) Date: Mon, 5 Dec 2016 15:19:55 +0100 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: <2864ea23-e6ca-cf83-599f-f8ec149e8d67@gmail.com> References:

<2864ea23-e6ca-cf83-599f-f8ec149e8d67@gmail.com> Message-ID: Ah yes sorry LeaveOneGroupOut indeed! Also refer this example for nested cv - http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html Thx! On Mon, Dec 5, 2016 at 2:51 PM, Andy wrote: > > > On 12/04/2016 04:27 PM, Raghav R V wrote: > >> Hi! >> >> It looks like you are using the old `sklearn.cross_validation`'s >> LeaveOneLabelOut cross-validator. It has been deprecated since v0.18. >> >> Use the `LeaveOneLabelOut` from `sklearn.model_selection`, that should >> fix your issue I think (thought I have not looked into your code in detail). >> >> You mean LeaveOneGroupOut, right? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From ludo25_90 at hotmail.com Mon Dec 5 09:42:47 2016 From: ludo25_90 at hotmail.com (Ludovico Coletta) Date: Mon, 5 Dec 2016 14:42:47 +0000 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: thank you for the quick answer! The problem is that I have a different number of features for each cv folds, therefore I thought that I had to handle each cv fold separately. I did like you suggested, but training set of the outer cv is then further splitted 3 times (stratified kfold), which I think is suboptimal (for feature selection I indeed implemented a nested loso). One question: would it be so bad if I had nested loso for feature selection but the default stratified kfold for hyperparameter optimization? It would be some kind of double dipping in the nested cv, but the final set left out for test is not concerned. The other point is that maybe I got something wrong in the whole process. I have 26 subjects. CV 1: subject 26 is left out for the final test, subjects 1:24 are used for hyperparameter optimization, subject 25 is used to select the best hyperpameters CV 2: subject 1 is left out for the final test, subjects 2:25 are used for hyperparameter optimization, subject 26 is used to select the best hyperpameters. So until the end. Is that correct? Sorry for the trivial questions, but I am quite a beginner with both Python and ML Best Ludovico ________________________________ Da: scikit-learn per conto di scikit-learn-request at python.org Inviato: luned? 5 dicembre 2016 14.54 A: scikit-learn at python.org Oggetto: scikit-learn Digest, Vol 9, Issue 15 Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Re: Markov Clustering? (Gael Varoquaux) 2. Re: Markov Clustering? (Raphael C) 3. Re: Nested Leave One Subject Out (LOSO) cross validation with scikit (Andy) 4. Re: Nested Leave One Subject Out (LOSO) cross validation with scikit (Andy) ---------------------------------------------------------------------- Message: 1 Date: Mon, 5 Dec 2016 14:45:52 +0100 From: Gael Varoquaux To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Markov Clustering? Message-ID: <20161205134552.GK2327874 at phare.normalesup.org> Content-Type: text/plain; charset=iso-8859-1 Interestingly, a couple of days before this thread was started a researcher in a top lab of a huge private-sector company had mentionned to me that they found this algorithm very useful in practice (sorry for taking time to point this out, I just needed to check with him that indeed it was this specific algorithm). G On Sun, Dec 04, 2016 at 08:18:54AM +0000, Raphael C wrote: > I think you get a better view of the importance of Markov Clustering in > academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q= > Markov+clustering . > Raphael > On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > Thanks for pointing that out, I sort of picked it up by word of mouth so > I'd assumed it had a bit more precedence in the academic world. ? > I'll look into it a little more, but I'd definitely be interested in > contributing something else if that doesn't work out. > -Allan > On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > Hey Allan. > None of the references apart from the last one seems to be published in > a peer-reviewed place, is that right? > And "A stochastic uncoupling process for graphs" has 13 citations since > 2000. Unless there is a more prominent > publication or evidence of heavy use, I think it's disqualified. > Academia is certainly not the only metric for evaluation, so if you > have others, that's good, too ;) > Best, > Andy > On 12/03/2016 04:33 PM, Allan Visochek wrote: > Hey Andy, > This algorithm does operate on sparse graphs so it may be beyond > the scope of sci-kit learn, let me know what you think.? > The website is here, it includes a brief description of how the > algorithm operates under Documentation -> Overview1 and Overview2.? > The references listed on the website are included below. > Best, > -Allan > [1]?Stijn van Dongen.?Graph Clustering by Flow Simulation. PhD > thesis, University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > [2]?Stijn van Dongen.?A cluster algorithm for graphs. Technical > Report INS-R0010, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > [3]?Stijn van Dongen.?A stochastic uncoupling process for graphs. > Technical Report INS-R0011, National Research Institute for > Mathematics and Computer Science in the Netherlands, Amsterdam, May > 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > [4]?Stijn van Dongen.?Performance criteria for graph clustering and > Markov cluster experiments. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the > Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > [5]?Enright A.J., Van Dongen S., Ouzounis C.A.?An efficient > algorithm for large-scale detection of protein families, Nucleic > Acids Research 30(7):1575-1584 (2002). > On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have > algorithms that operate on data-induced > graphs, like SpectralClustering, but we don't really implement > general graph algorithms (there's no PageRank or community > detection). > Andy > On 12/03/2016 12:19 PM, Allan Visochek wrote: > Hi there, > My name is Allan Visochek, I'm a data scientist and web > developer and I love scikit-learn so first of all, thanks > so much for the work that you do.? > I'm reaching out because I've found the markov clustering > algorithm to be quite useful for me in some of my work and > noticed that there is no implementation in scikit-learn, is > anybody working on this? If not, id be happy to take this > on. I'm new to open source, but I've been working with > python for a few years now.? > Best, > -Allan > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org https://mail.python.org/ mail.python.org Mailing Lists mail.python.org mail.python.org Mailing Lists: Welcome! Below is a listing of all the public mailing lists on mail.python.org. Click on a list name to get more information ... > mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux Gael Varoquaux (@GaelVaroquaux) | Twitter twitter.com The latest Tweets from Gael Varoquaux (@GaelVaroquaux). Researcher and geek: ?Brain, Data, & Computational science ?#python #pydata #sklearn ?Machine learning for fMRI ?Photography on @artgael. Paris, France Ga?l Varoquaux: computer / data / brain science gael-varoquaux.info Ga?l Varoquaux, computer / data / brain science ... Latest posts . misc personnal programming science Data science instrumenting social media for advertising is ... ------------------------------ Message: 2 Date: Mon, 5 Dec 2016 13:51:26 +0000 From: Raphael C To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Markov Clustering? Message-ID: Content-Type: text/plain; charset=UTF-8 And... [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm has 1201 citations. I think it's fair to say the method is very widely known and used. Raphael On 5 December 2016 at 13:45, Gael Varoquaux wrote: > Interestingly, a couple of days before this thread was started a > researcher in a top lab of a huge private-sector company had mentionned > to me that they found this algorithm very useful in practice (sorry for > taking time to point this out, I just needed to check with him that > indeed it was this specific algorithm). > > G > > On Sun, Dec 04, 2016 at 08:18:54AM +0000, Raphael C wrote: >> I think you get a better view of the importance of Markov Clustering in >> academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q= >> Markov+clustering . > >> Raphael > >> On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > >> Thanks for pointing that out, I sort of picked it up by word of mouth so >> I'd assumed it had a bit more precedence in the academic world. > >> I'll look into it a little more, but I'd definitely be interested in >> contributing something else if that doesn't work out. > >> -Allan > >> On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > >> Hey Allan. > >> None of the references apart from the last one seems to be published in >> a peer-reviewed place, is that right? >> And "A stochastic uncoupling process for graphs" has 13 citations since >> 2000. Unless there is a more prominent >> publication or evidence of heavy use, I think it's disqualified. >> Academia is certainly not the only metric for evaluation, so if you >> have others, that's good, too ;) > >> Best, >> Andy > >> On 12/03/2016 04:33 PM, Allan Visochek wrote: > >> Hey Andy, > >> This algorithm does operate on sparse graphs so it may be beyond >> the scope of sci-kit learn, let me know what you think. >> The website is here, it includes a brief description of how the >> algorithm operates under Documentation -> Overview1 and Overview2. >> The references listed on the website are included below. > >> Best, >> -Allan > > >> [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD >> thesis, University of Utrecht, May 2000. >> http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > >> [2] Stijn van Dongen. A cluster algorithm for graphs. Technical >> Report INS-R0010, National Research Institute for Mathematics and >> Computer Science in the Netherlands, Amsterdam, May 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > >> [3] Stijn van Dongen. A stochastic uncoupling process for graphs. >> Technical Report INS-R0011, National Research Institute for >> Mathematics and Computer Science in the Netherlands, Amsterdam, May >> 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > >> [4] Stijn van Dongen. Performance criteria for graph clustering and >> Markov cluster experiments. Technical Report INS-R0012, National >> Research Institute for Mathematics and Computer Science in the >> Netherlands, Amsterdam, May 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > >> [5] Enright A.J., Van Dongen S., Ouzounis C.A. An efficient >> algorithm for large-scale detection of protein families, Nucleic >> Acids Research 30(7):1575-1584 (2002). > > >> On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > >> Hi Allan. >> Can you provide the original paper? >> It this something usually used on sparse graphs? We do have >> algorithms that operate on data-induced >> graphs, like SpectralClustering, but we don't really implement >> general graph algorithms (there's no PageRank or community >> detection). > >> Andy > > >> On 12/03/2016 12:19 PM, Allan Visochek wrote: > >> Hi there, > >> My name is Allan Visochek, I'm a data scientist and web >> developer and I love scikit-learn so first of all, thanks >> so much for the work that you do. > >> I'm reaching out because I've found the markov clustering >> algorithm to be quite useful for me in some of my work and >> noticed that there is no implementation in scikit-learn, is >> anybody working on this? If not, id be happy to take this >> on. I'm new to open source, but I've been working with >> python for a few years now. > >> Best, >> -Allan > > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > >> _______________________________________________ scikit-learn >> mailing list scikit-learn at python.org https://mail.python.org/ mail.python.org Mailing Lists mail.python.org mail.python.org Mailing Lists: Welcome! Below is a listing of all the public mailing lists on mail.python.org. Click on a list name to get more information ... >> mailman/listinfo/scikit-learn > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python