From andrewholmes82 at icloud.com Wed Jun 1 07:02:48 2016 From: andrewholmes82 at icloud.com (Andrew Holmes) Date: Wed, 01 Jun 2016 12:02:48 +0100 Subject: [scikit-learn] Artificial neural network not learning lower values of the training sample In-Reply-To: References: <3FB2A8F0-9EAE-48CB-A0AC-98D9F27EA5A8@icloud.com> <4A030BCE-4A5F-42F6-B7DB-40D0D388EBBD@icloud.com> Message-ID: <5E7546D8-64A2-4FA4-B156-2B1E14D94005@icloud.com> A previous commenter asked about the published research you mentioned in which this was working ok. If you?re using the same data as them, could you try to replicate their results first? Best wishes Andrew @andrewholmes82 > On 31 May 2016, at 20:05, muhammad waseem wrote: > > I try to balance it out, the dataset is very periodic type (similar behaviour in an year) > > On Tue, May 31, 2016 at 8:01 PM, Andrew Holmes > wrote: > Is the training set unbalanced between high and low values? Ie, many more of the high ones? > > Best wishes > Andrew > > @andrewholmes82 > > > > > > > > >> On 31 May 2016, at 20:00, muhammad waseem > wrote: >> >> Yes, it has poor performance (higher errors) on lower values. >> I have tried random forest but as I mentioned it did not give good results either, I can try SVR. >> >> Kindest Regards >> Waseem >> >> On Tue, May 31, 2016 at 6:54 PM, Andrew Holmes > wrote: >> When you say it?s not learning ?lower values?, does that mean the model has good predictions on high values in the test set, but poor performance on the low ones? >> >> Have you tried simpler models like tree, random forest and svm as a benchmark? >> >> Best wishes >> Andrew >> >> @andrewholmes82 >> >> >> >> >> >> >> >> >>> On 31 May 2016, at 16:59, Andrew Holmes > wrote: >>> >>> If the problem is that it?s confusing day and night, are you including time of day as a parameter? >>> >>> Best wishes >>> Andrew >>> >>> @andrewholmes82 >>> >>> >>> >>> >>> >>> >>> >>> >>>> On 31 May 2016, at 16:55, muhammad waseem > wrote: >>>> >>>> Hi All, >>>> I am trying to train an ANN but until now it is not learning the lower values of the training sample. I have tried using different python libraries to train ANN. The aim is to predict solar radiation from other weather parameters (regression problem). I think the ANN is confusing lower values (winter/cloudy days) with the night-time values (probably). I have tried the following but none of them worked; >>>> >>>> 1. Scaling data between different values e.g. [0,1],[-1,1] >>>> 2. Standardising data to have zero mean and unit variance >>>> 3. Shuffling the data >>>> 4. Increasing the training samples (from 3 years to 10 years) >>>> 5. Using different train functions >>>> 6. Trying different transfer functions >>>> 7. Using few input variables >>>> 8. Varying hidden layers and hidden layers' neurons >>>> >>>> Any idea what could be wrong or any directions to try? >>>> >>>> Thanks >>>> Kindest Regards >>>> Waseem >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.waseem.ahmad at gmail.com Wed Jun 1 08:05:39 2016 From: m.waseem.ahmad at gmail.com (muhammad waseem) Date: Wed, 1 Jun 2016 13:05:39 +0100 Subject: [scikit-learn] Artificial neural network not learning lower values of the training sample In-Reply-To: <5E7546D8-64A2-4FA4-B156-2B1E14D94005@icloud.com> References: <3FB2A8F0-9EAE-48CB-A0AC-98D9F27EA5A8@icloud.com> <4A030BCE-4A5F-42F6-B7DB-40D0D388EBBD@icloud.com> <5E7546D8-64A2-4FA4-B156-2B1E14D94005@icloud.com> Message-ID: It's not the same data (different locations) but I have tried to use same input and output variables. Thanks Waseem On Wed, Jun 1, 2016 at 12:02 PM, Andrew Holmes wrote: > A previous commenter asked about the published research you mentioned in > which this was working ok. If you?re using the same data as them, could you > try to replicate their results first? > > Best wishes > Andrew > > @andrewholmes82 > > > > > > > > > On 31 May 2016, at 20:05, muhammad waseem > wrote: > > I try to balance it out, the dataset is very periodic type (similar > behaviour in an year) > > On Tue, May 31, 2016 at 8:01 PM, Andrew Holmes > wrote: > >> Is the training set unbalanced between high and low values? Ie, many more >> of the high ones? >> >> Best wishes >> Andrew >> >> @andrewholmes82 >> >> >> >> >> >> >> >> >> On 31 May 2016, at 20:00, muhammad waseem >> wrote: >> >> Yes, it has poor performance (higher errors) on lower values. >> I have tried random forest but as I mentioned it did not give good >> results either, I can try SVR. >> >> Kindest Regards >> Waseem >> >> On Tue, May 31, 2016 at 6:54 PM, Andrew Holmes > > wrote: >> >>> When you say it?s not learning ?lower values?, does that mean the model >>> has good predictions on high values in the test set, but poor performance >>> on the low ones? >>> >>> Have you tried simpler models like tree, random forest and svm as a >>> benchmark? >>> >>> Best wishes >>> Andrew >>> >>> @andrewholmes82 >>> >>> >>> >>> >>> >>> >>> >>> >>> On 31 May 2016, at 16:59, Andrew Holmes >>> wrote: >>> >>> If the problem is that it?s confusing day and night, are you including >>> time of day as a parameter? >>> >>> Best wishes >>> Andrew >>> >>> @andrewholmes82 >>> >>> >>> >>> >>> >>> >>> >>> >>> On 31 May 2016, at 16:55, muhammad waseem >>> wrote: >>> >>> Hi All, >>> I am trying to train an ANN but until now it is not learning the lower >>> values of the training sample. I have tried using different python >>> libraries to train ANN. The aim is to predict solar radiation from other >>> weather parameters (regression problem). I think the ANN is confusing lower >>> values (winter/cloudy days) with the night-time values (probably). I have >>> tried the following but none of them worked; >>> >>> 1. Scaling data between different values e.g. [0,1],[-1,1] >>> 2. Standardising data to have zero mean and unit variance >>> 3. Shuffling the data >>> 4. Increasing the training samples (from 3 years to 10 years) >>> 5. Using different train functions >>> 6. Trying different transfer functions >>> 7. Using few input variables >>> 8. Varying hidden layers and hidden layers' neurons >>> >>> Any idea what could be wrong or any directions to try? >>> >>> Thanks >>> Kindest Regards >>> Waseem >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cbrew at acm.org Wed Jun 1 08:40:53 2016 From: cbrew at acm.org (chris brew) Date: Wed, 1 Jun 2016 13:40:53 +0100 Subject: [scikit-learn] Artificial neural network not learning lower values of the training sample In-Reply-To: References: <3FB2A8F0-9EAE-48CB-A0AC-98D9F27EA5A8@icloud.com> <4A030BCE-4A5F-42F6-B7DB-40D0D388EBBD@icloud.com> <5E7546D8-64A2-4FA4-B156-2B1E14D94005@icloud.com> Message-ID: Experienced machine learning people usually start by trying to exactly replicate what the paper did, using exactly the same data, and exactly the same methods, and if possible, even exactly the same software. It is very comforting if you can do this, because you can then go ahead and make changes, secure in the knowledge that the reasons for any changes in output are due to the changes you made, and not to small mistakes in data preparation, tool arguments or something. So it is a good plan to start by seeing whether you can arrange to have the same setup as the writers of the original paper. It is polite to try to do this on your own if you can (and the effort is a useful learning experience, usually) Maybe the data and/or software is publicly available. If not, authors are often willing and able to share material privately, so you could write to them and ask for help. Since you are showing interest in their work, they will want to help if they can. And if they do, you have the beginnings of a useful personal connection. The only pre-condition is that you should (again as matter of politeness) first make an honest attempt to replicate their work, and make sure you really understand it. After you have done that, asking for help is appropriate. Generally speaking, the time to ask questions on a public forum like scikit-learn is also after you have done all you can to solve your problem. Best wishes Chris On 1 June 2016 at 13:05, muhammad waseem wrote: > It's not the same data (different locations) but I have tried to use same > input and output variables. > > Thanks > Waseem > > On Wed, Jun 1, 2016 at 12:02 PM, Andrew Holmes > wrote: > >> A previous commenter asked about the published research you mentioned in >> which this was working ok. If you?re using the same data as them, could you >> try to replicate their results first? >> >> Best wishes >> Andrew >> >> @andrewholmes82 >> >> >> >> >> >> >> >> >> On 31 May 2016, at 20:05, muhammad waseem >> wrote: >> >> I try to balance it out, the dataset is very periodic type (similar >> behaviour in an year) >> >> On Tue, May 31, 2016 at 8:01 PM, Andrew Holmes > > wrote: >> >>> Is the training set unbalanced between high and low values? Ie, many >>> more of the high ones? >>> >>> Best wishes >>> Andrew >>> >>> @andrewholmes82 >>> >>> >>> >>> >>> >>> >>> >>> >>> On 31 May 2016, at 20:00, muhammad waseem >>> wrote: >>> >>> Yes, it has poor performance (higher errors) on lower values. >>> I have tried random forest but as I mentioned it did not give good >>> results either, I can try SVR. >>> >>> Kindest Regards >>> Waseem >>> >>> On Tue, May 31, 2016 at 6:54 PM, Andrew Holmes < >>> andrewholmes82 at icloud.com> wrote: >>> >>>> When you say it?s not learning ?lower values?, does that mean the model >>>> has good predictions on high values in the test set, but poor performance >>>> on the low ones? >>>> >>>> Have you tried simpler models like tree, random forest and svm as a >>>> benchmark? >>>> >>>> Best wishes >>>> Andrew >>>> >>>> @andrewholmes82 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 31 May 2016, at 16:59, Andrew Holmes >>>> wrote: >>>> >>>> If the problem is that it?s confusing day and night, are you including >>>> time of day as a parameter? >>>> >>>> Best wishes >>>> Andrew >>>> >>>> @andrewholmes82 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 31 May 2016, at 16:55, muhammad waseem >>>> wrote: >>>> >>>> Hi All, >>>> I am trying to train an ANN but until now it is not learning the lower >>>> values of the training sample. I have tried using different python >>>> libraries to train ANN. The aim is to predict solar radiation from other >>>> weather parameters (regression problem). I think the ANN is confusing lower >>>> values (winter/cloudy days) with the night-time values (probably). I have >>>> tried the following but none of them worked; >>>> >>>> 1. Scaling data between different values e.g. [0,1],[-1,1] >>>> 2. Standardising data to have zero mean and unit variance >>>> 3. Shuffling the data >>>> 4. Increasing the training samples (from 3 years to 10 years) >>>> 5. Using different train functions >>>> 6. Trying different transfer functions >>>> 7. Using few input variables >>>> 8. Varying hidden layers and hidden layers' neurons >>>> >>>> Any idea what could be wrong or any directions to try? >>>> >>>> Thanks >>>> Kindest Regards >>>> Waseem >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Wed Jun 1 09:01:41 2016 From: ahowe42 at gmail.com (Andrew Howe) Date: Wed, 1 Jun 2016 16:01:41 +0300 Subject: [scikit-learn] Artificial neural network not learning lower values of the training sample In-Reply-To: References: <3FB2A8F0-9EAE-48CB-A0AC-98D9F27EA5A8@icloud.com> Message-ID: What about adding a simple binary night / day flag? While it's less information than hour, it will provide a distinct cutoff for the network to use. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD Editor-in-Chief, European Journal of Mathematical Sciences Executive Editor, European Journal of Pure and Applied Mathematics www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Tue, May 31, 2016 at 8:47 PM, muhammad waseem wrote: > Thanks for your reply. I have day, month, hour, temp, relative humidity, > Wind speed as my input variables. I can't think of any other dependant > variables. It is quite strange to me that I don't get results after using > these input variables. > > On Tue, May 31, 2016 at 4:59 PM, Andrew Holmes > wrote: > >> If the problem is that it?s confusing day and night, are you including >> time of day as a parameter? >> >> Best wishes >> Andrew >> >> @andrewholmes82 >> >> >> >> >> >> >> >> >> On 31 May 2016, at 16:55, muhammad waseem >> wrote: >> >> Hi All, >> I am trying to train an ANN but until now it is not learning the lower >> values of the training sample. I have tried using different python >> libraries to train ANN. The aim is to predict solar radiation from other >> weather parameters (regression problem). I think the ANN is confusing lower >> values (winter/cloudy days) with the night-time values (probably). I have >> tried the following but none of them worked; >> >> 1. Scaling data between different values e.g. [0,1],[-1,1] >> 2. Standardising data to have zero mean and unit variance >> 3. Shuffling the data >> 4. Increasing the training samples (from 3 years to 10 years) >> 5. Using different train functions >> 6. Trying different transfer functions >> 7. Using few input variables >> 8. Varying hidden layers and hidden layers' neurons >> >> Any idea what could be wrong or any directions to try? >> >> Thanks >> Kindest Regards >> Waseem >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.waseem.ahmad at gmail.com Wed Jun 1 09:38:14 2016 From: m.waseem.ahmad at gmail.com (muhammad waseem) Date: Wed, 1 Jun 2016 14:38:14 +0100 Subject: [scikit-learn] Artificial neural network not learning lower values of the training sample In-Reply-To: References: <3FB2A8F0-9EAE-48CB-A0AC-98D9F27EA5A8@icloud.com> Message-ID: @Chris: Thanks for your reply, I will try to contact them. @Andrew Howe: I think I will try to add number of sunshine hours as another variable, which will be the same value for the whole day. Thanks Kindest Regards Waseem On Wed, Jun 1, 2016 at 2:01 PM, Andrew Howe wrote: > What about adding a simple binary night / day flag? While it's less > information than hour, it will provide a distinct cutoff for the network to > use. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > Editor-in-Chief, European Journal of Mathematical Sciences > Executive Editor, European Journal of Pure and Applied Mathematics > www.andrewhowe.com > http://www.linkedin.com/in/ahowe42 > https://www.researchgate.net/profile/John_Howe12/ > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > On Tue, May 31, 2016 at 8:47 PM, muhammad waseem > wrote: > >> Thanks for your reply. I have day, month, hour, temp, relative humidity, >> Wind speed as my input variables. I can't think of any other dependant >> variables. It is quite strange to me that I don't get results after using >> these input variables. >> >> On Tue, May 31, 2016 at 4:59 PM, Andrew Holmes > > wrote: >> >>> If the problem is that it?s confusing day and night, are you including >>> time of day as a parameter? >>> >>> Best wishes >>> Andrew >>> >>> @andrewholmes82 >>> >>> >>> >>> >>> >>> >>> >>> >>> On 31 May 2016, at 16:55, muhammad waseem >>> wrote: >>> >>> Hi All, >>> I am trying to train an ANN but until now it is not learning the lower >>> values of the training sample. I have tried using different python >>> libraries to train ANN. The aim is to predict solar radiation from other >>> weather parameters (regression problem). I think the ANN is confusing lower >>> values (winter/cloudy days) with the night-time values (probably). I have >>> tried the following but none of them worked; >>> >>> 1. Scaling data between different values e.g. [0,1],[-1,1] >>> 2. Standardising data to have zero mean and unit variance >>> 3. Shuffling the data >>> 4. Increasing the training samples (from 3 years to 10 years) >>> 5. Using different train functions >>> 6. Trying different transfer functions >>> 7. Using few input variables >>> 8. Varying hidden layers and hidden layers' neurons >>> >>> Any idea what could be wrong or any directions to try? >>> >>> Thanks >>> Kindest Regards >>> Waseem >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruchika.work at gmail.com Wed Jun 1 13:24:35 2016 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Wed, 1 Jun 2016 10:24:35 -0700 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: Message-ID: Thanks, Ruchika ---------------------------------------- Dr Ruchika Nayyar, Post Doctoral Fellow for ATLAS Collaboration University of Arizona Arizona, USA. -------------------------------------------- ---------- Forwarded message ---------- From: Date: Wed, Jun 1, 2016 at 10:23 AM Subject: ValueError To: ruchika.work at gmail.com This list allows posts by subscribers only. Please subscribe at https://mail.python.org/mailman/listinfo/scikit-learn to post to the list. ---------- Forwarded message ---------- From: Ruchika Nayyar To: scikit-learn at python.org Cc: Date: Wed, 1 Jun 2016 10:23:14 -0700 Subject: ValueError Hi I am new to scikit-learn and while writing a python script to do a simple BDT using scikit-learn. I see error when I do this: from sklearn import datasets Traceback (most recent call last): File "bdt.py", line 12, in from sklearn import datasets File "/Library/Python/2.7/site-packages/sklearn/__init__.py", line 57, in from .base import clone File "/Library/Python/2.7/site-packages/sklearn/base.py", line 11, in from .utils.fixes import signature File "/Library/Python/2.7/site-packages/sklearn/utils/__init__.py", line 10, in from .murmurhash import murmurhash3_32 File "numpy.pxd", line 155, in init sklearn.utils.murmurhash (sklearn/utils/murmurhash.c:5029) ValueError: numpy.dtype has the wrong size, try recompiling I have already tried to uninstall numpy/scipy and pandas. They all are the latest and compatible version but something is not right. Can you tell me what I am doing wrong? Thanks, Ruchika ---------------------------------------- Dr Ruchika Nayyar, Post Doctoral Fellow for ATLAS Collaboration University of Arizona Arizona, USA. -------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Wed Jun 1 13:46:36 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 1 Jun 2016 13:46:36 -0400 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: Message-ID: Hi Ruchika, could you maybe post the results from $ python -c 'import numpy; print(numpy.__version__)' 1.11.0 $ python -c 'import numpy; print(scipy.__version__)' 0.17.0 just to make sure that these are indeed the latest versions? However, I suspect that this is more of a compile rather than a version issue since scikit should work fine on older versions of NumPy and SciPy ? e.g., one of the CI tests is running with NUMPY_VERSION=?1.6.2? and SCIPY_VERSION="0.11.0" Does NumPy run correctly if you run some examples without scikit-learn? E.g., you may want to run import numpy numpy.test('full') import scipy scipy.test('full?) to narrow down the problem further. And how did you compile & install scikit-learn? Best, Sebastian > On Jun 1, 2016, at 1:24 PM, Ruchika Nayyar wrote: > > > Thanks, > Ruchika > ---------------------------------------- > Dr Ruchika Nayyar, > Post Doctoral Fellow for ATLAS Collaboration > University of Arizona > Arizona, USA. > -------------------------------------------- > > ---------- Forwarded message ---------- > From: > Date: Wed, Jun 1, 2016 at 10:23 AM > Subject: ValueError > To: ruchika.work at gmail.com > > > This list allows posts by subscribers only. Please subscribe at > https://mail.python.org/mailman/listinfo/scikit-learn to post to the > list. > > > > ---------- Forwarded message ---------- > From: Ruchika Nayyar > To: scikit-learn at python.org > Cc: > Date: Wed, 1 Jun 2016 10:23:14 -0700 > Subject: ValueError > Hi > > I am new to scikit-learn and while writing a python script to do a simple BDT using scikit-learn. I see error when I do this: > > from sklearn import datasets > > > Traceback (most recent call last): > File "bdt.py", line 12, in > from sklearn import datasets > File "/Library/Python/2.7/site-packages/sklearn/__init__.py", line 57, in > from .base import clone > File "/Library/Python/2.7/site-packages/sklearn/base.py", line 11, in > from .utils.fixes import signature > File "/Library/Python/2.7/site-packages/sklearn/utils/__init__.py", line 10, in > from .murmurhash import murmurhash3_32 > File "numpy.pxd", line 155, in init sklearn.utils.murmurhash (sklearn/utils/murmurhash.c:5029) > ValueError: numpy.dtype has the wrong size, try recompiling > > > I have already tried to uninstall numpy/scipy and pandas. They all are the latest and compatible version but something is not right. Can you tell me what I am doing wrong? > > Thanks, > Ruchika > ---------------------------------------- > Dr Ruchika Nayyar, > Post Doctoral Fellow for ATLAS Collaboration > University of Arizona > Arizona, USA. > -------------------------------------------- > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From maniteja.modesty067 at gmail.com Wed Jun 1 13:52:46 2016 From: maniteja.modesty067 at gmail.com (Maniteja Nandana) Date: Wed, 1 Jun 2016 23:22:46 +0530 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: Message-ID: Hi , I hope I am understanding the question correctly here. This happens when there is mismatch between numpy installed and the one scikit-learn is compiled with. In case there are multiple versions of numpy installed, it is preferable to put the newest one on PYTHONPATH. If a stable version is needed, uninstall previously installed scikit-learn, use ``pip install --upgrade numpy`` and then reinstall scikit-learn. Hope it is helpful. Regards, Maniteja. _________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From nfliu at uw.edu Wed Jun 1 13:50:08 2016 From: nfliu at uw.edu (Nelson Liu) Date: Wed, 01 Jun 2016 17:50:08 +0000 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: Message-ID: Hi Ruchika, See if any of the suggestions in this issue ( https://github.com/scikit-learn/scikit-learn/issues/6706) help, it seems to be the same issue. Try reinstalling scikit-learn as well. Nelson Liu On Wed, Jun 1, 2016 at 10:33 AM Ruchika Nayyar wrote: > > Thanks, > Ruchika > ---------------------------------------- > Dr Ruchika Nayyar, > Post Doctoral Fellow for ATLAS Collaboration > University of Arizona > Arizona, USA. > -------------------------------------------- > > ---------- Forwarded message ---------- > From: > Date: Wed, Jun 1, 2016 at 10:23 AM > Subject: ValueError > To: ruchika.work at gmail.com > > > This list allows posts by subscribers only. Please subscribe at > https://mail.python.org/mailman/listinfo/scikit-learn to post to the > list. > > > > ---------- Forwarded message ---------- > From: Ruchika Nayyar > To: scikit-learn at python.org > Cc: > Date: Wed, 1 Jun 2016 10:23:14 -0700 > Subject: ValueError > Hi > > I am new to scikit-learn and while writing a python script to do a simple > BDT using scikit-learn. I see error when I do this: > > from sklearn import datasets > > > Traceback (most recent call last): > > File "bdt.py", line 12, in > > from sklearn import datasets > > File "/Library/Python/2.7/site-packages/sklearn/__init__.py", line 57, > in > > from .base import clone > > File "/Library/Python/2.7/site-packages/sklearn/base.py", line 11, in > > > from .utils.fixes import signature > > File "/Library/Python/2.7/site-packages/sklearn/utils/__init__.py", line > 10, in > > from .murmurhash import murmurhash3_32 > > File "numpy.pxd", line 155, in init sklearn.utils.murmurhash > (sklearn/utils/murmurhash.c:5029) > > ValueError: numpy.dtype has the wrong size, try recompiling > > > I have already tried to uninstall numpy/scipy and pandas. They all are the > latest and compatible version but something is not right. Can you tell me > what I am doing wrong? > > Thanks, > Ruchika > ---------------------------------------- > Dr Ruchika Nayyar, > Post Doctoral Fellow for ATLAS Collaboration > University of Arizona > Arizona, USA. > -------------------------------------------- > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruchika.work at gmail.com Wed Jun 1 13:55:51 2016 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Wed, 1 Jun 2016 10:55:51 -0700 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: Message-ID: Hello Sebastian Thanks for some insight.. So here are some of my responses 1) $ python -c 'import numpy; print(numpy.__version__)' 1.8.0rc1 2) python -c 'import numpy; print(scipy.__version__)' Traceback (most recent call last): File "", line 1, in NameError: name 'scipy' is not defined And I installed everything using pip install numpy/scikit-learn and so on. But when I tried to do this pip install --upgrade scipy Requirement already up-to-date: scipy in /Library/Python/2.7/site-packages Requirement already up-to-date: numpy>=1.6.2 in /Library/Python/2.7/site-packages (from scipy) So not sure why it is not being defined. Thanks, Ruchika ---------------------------------------- Dr Ruchika Nayyar, Post Doctoral Fellow for ATLAS Collaboration University of Arizona Arizona, USA. -------------------------------------------- On Wed, Jun 1, 2016 at 10:46 AM, Sebastian Raschka < mail at sebastianraschka.com> wrote: > Hi Ruchika, > > could you maybe post the results from > > $ python -c 'import numpy; print(numpy.__version__)' > 1.11.0 > $ python -c 'import numpy; print(scipy.__version__)' > 0.17.0 > > just to make sure that these are indeed the latest versions? However, I > suspect that this is more of a compile rather than a version issue since > scikit should work fine on older versions of NumPy and SciPy ? e.g., one of > the CI tests is running with NUMPY_VERSION=?1.6.2? and > SCIPY_VERSION="0.11.0" > > Does NumPy run correctly if you run some examples without scikit-learn? > E.g., you may want to run > > import numpy > numpy.test('full') > > import scipy > scipy.test('full?) > > to narrow down the problem further. > > And how did you compile & install scikit-learn? > > Best, > Sebastian > > > On Jun 1, 2016, at 1:24 PM, Ruchika Nayyar > wrote: > > > > > > Thanks, > > Ruchika > > ---------------------------------------- > > Dr Ruchika Nayyar, > > Post Doctoral Fellow for ATLAS Collaboration > > University of Arizona > > Arizona, USA. > > -------------------------------------------- > > > > ---------- Forwarded message ---------- > > From: > > Date: Wed, Jun 1, 2016 at 10:23 AM > > Subject: ValueError > > To: ruchika.work at gmail.com > > > > > > This list allows posts by subscribers only. Please subscribe at > > https://mail.python.org/mailman/listinfo/scikit-learn to post to the > > list. > > > > > > > > ---------- Forwarded message ---------- > > From: Ruchika Nayyar > > To: scikit-learn at python.org > > Cc: > > Date: Wed, 1 Jun 2016 10:23:14 -0700 > > Subject: ValueError > > Hi > > > > I am new to scikit-learn and while writing a python script to do a > simple BDT using scikit-learn. I see error when I do this: > > > > from sklearn import datasets > > > > > > Traceback (most recent call last): > > File "bdt.py", line 12, in > > from sklearn import datasets > > File "/Library/Python/2.7/site-packages/sklearn/__init__.py", line 57, > in > > from .base import clone > > File "/Library/Python/2.7/site-packages/sklearn/base.py", line 11, in > > > from .utils.fixes import signature > > File "/Library/Python/2.7/site-packages/sklearn/utils/__init__.py", > line 10, in > > from .murmurhash import murmurhash3_32 > > File "numpy.pxd", line 155, in init sklearn.utils.murmurhash > (sklearn/utils/murmurhash.c:5029) > > ValueError: numpy.dtype has the wrong size, try recompiling > > > > > > I have already tried to uninstall numpy/scipy and pandas. They all are > the latest and compatible version but something is not right. Can you tell > me what I am doing wrong? > > > > Thanks, > > Ruchika > > ---------------------------------------- > > Dr Ruchika Nayyar, > > Post Doctoral Fellow for ATLAS Collaboration > > University of Arizona > > Arizona, USA. > > -------------------------------------------- > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dale.T.Smith at macys.com Wed Jun 1 13:59:02 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Wed, 1 Jun 2016 17:59:02 +0000 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: Message-ID: The best solution is to use either Canopy or Anaconda and avoid installing the PyData ecosystem manually. These distributions neatly avoid compile and version problems. __________________________________________________________________________________________ Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Maniteja Nandana Sent: Wednesday, June 1, 2016 1:53 PM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Fwd: ValueError ? EXT MSG: Hi , I hope I am understanding the question correctly here. This happens when there is mismatch between numpy installed and the one scikit-learn is compiled with. In case there are multiple versions of numpy installed, it is preferable to put the newest one on PYTHONPATH. If a stable version is needed, uninstall previously installed scikit-learn, use ``pip install --upgrade numpy`` and then reinstall scikit-learn. Hope it is helpful. Regards, Maniteja. _________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Wed Jun 1 14:00:58 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 1 Jun 2016 14:00:58 -0400 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: Message-ID: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> Sorry, $ python -c 'import numpy; print(scipy.__version__)? was a type, it should be $ python -c 'import scipy; print(scipy.__version__)? However, I?d recommend looking at the Issue 6706 as Nelson Liu suggested for further debugging (https://github.com/scikit-learn/scikit-learn/issues/6706)! Like Maniteja suggested, it is likely due to ?a mismatch between numpy installed and the one scikit-learn is compiled with" Best, Sebastian > On Jun 1, 2016, at 1:55 PM, Ruchika Nayyar wrote: > > Hello Sebastian > > Thanks for some insight.. So here are some of my responses > > 1) $ python -c 'import numpy; print(numpy.__version__)' > 1.8.0rc1 > > 2) python -c 'import numpy; print(scipy.__version__)' > Traceback (most recent call last): > File "", line 1, in > NameError: name 'scipy' is not defined > > > And I installed everything using pip install numpy/scikit-learn > and so on. > But when I tried to do this > pip install --upgrade scipy > Requirement already up-to-date: scipy in /Library/Python/2.7/site-packages > Requirement already up-to-date: numpy>=1.6.2 in /Library/Python/2.7/site-packages (from scipy) > > > So not sure why it is not being defined. > Thanks, > Ruchika > ---------------------------------------- > Dr Ruchika Nayyar, > Post Doctoral Fellow for ATLAS Collaboration > University of Arizona > Arizona, USA. > -------------------------------------------- > > On Wed, Jun 1, 2016 at 10:46 AM, Sebastian Raschka wrote: > Hi Ruchika, > > could you maybe post the results from > > $ python -c 'import numpy; print(numpy.__version__)' > 1.11.0 > $ python -c 'import numpy; print(scipy.__version__)' > 0.17.0 > > just to make sure that these are indeed the latest versions? However, I suspect that this is more of a compile rather than a version issue since scikit should work fine on older versions of NumPy and SciPy ? e.g., one of the CI tests is running with NUMPY_VERSION=?1.6.2? and SCIPY_VERSION="0.11.0" > > Does NumPy run correctly if you run some examples without scikit-learn? > E.g., you may want to run > > import numpy > numpy.test('full') > > import scipy > scipy.test('full?) > > to narrow down the problem further. > > And how did you compile & install scikit-learn? > > Best, > Sebastian > > > On Jun 1, 2016, at 1:24 PM, Ruchika Nayyar wrote: > > > > > > Thanks, > > Ruchika > > ---------------------------------------- > > Dr Ruchika Nayyar, > > Post Doctoral Fellow for ATLAS Collaboration > > University of Arizona > > Arizona, USA. > > -------------------------------------------- > > > > ---------- Forwarded message ---------- > > From: > > Date: Wed, Jun 1, 2016 at 10:23 AM > > Subject: ValueError > > To: ruchika.work at gmail.com > > > > > > This list allows posts by subscribers only. Please subscribe at > > https://mail.python.org/mailman/listinfo/scikit-learn to post to the > > list. > > > > > > > > ---------- Forwarded message ---------- > > From: Ruchika Nayyar > > To: scikit-learn at python.org > > Cc: > > Date: Wed, 1 Jun 2016 10:23:14 -0700 > > Subject: ValueError > > Hi > > > > I am new to scikit-learn and while writing a python script to do a simple BDT using scikit-learn. I see error when I do this: > > > > from sklearn import datasets > > > > > > Traceback (most recent call last): > > File "bdt.py", line 12, in > > from sklearn import datasets > > File "/Library/Python/2.7/site-packages/sklearn/__init__.py", line 57, in > > from .base import clone > > File "/Library/Python/2.7/site-packages/sklearn/base.py", line 11, in > > from .utils.fixes import signature > > File "/Library/Python/2.7/site-packages/sklearn/utils/__init__.py", line 10, in > > from .murmurhash import murmurhash3_32 > > File "numpy.pxd", line 155, in init sklearn.utils.murmurhash (sklearn/utils/murmurhash.c:5029) > > ValueError: numpy.dtype has the wrong size, try recompiling > > > > > > I have already tried to uninstall numpy/scipy and pandas. They all are the latest and compatible version but something is not right. Can you tell me what I am doing wrong? > > > > Thanks, > > Ruchika > > ---------------------------------------- > > Dr Ruchika Nayyar, > > Post Doctoral Fellow for ATLAS Collaboration > > University of Arizona > > Arizona, USA. > > -------------------------------------------- > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From maniteja.modesty067 at gmail.com Wed Jun 1 14:04:51 2016 From: maniteja.modesty067 at gmail.com (Maniteja Nandana) Date: Wed, 1 Jun 2016 23:34:51 +0530 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: Message-ID: Hi, As Nelson suggested, probably the version pip shows and imported in python are not matching. Could you see the output of ``pip freeze`` ? In case that is not 1.8.0rc then the solution in that issue would probably be helpful here. Regards, Maniteja. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruchika.work at gmail.com Wed Jun 1 14:08:20 2016 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Wed, 1 Jun 2016 11:08:20 -0700 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> Message-ID: I also copy pasted blindly. Actually I see that both numpy and scipy have older versions and not the new ones that I want. Let me look into the thread and do more debugging. Thanks, Ruchika ---------------------------------------- Dr Ruchika Nayyar, Post Doctoral Fellow for ATLAS Collaboration University of Arizona Arizona, USA. -------------------------------------------- On Wed, Jun 1, 2016 at 11:00 AM, Sebastian Raschka < mail at sebastianraschka.com> wrote: > Sorry, > > $ python -c 'import numpy; print(scipy.__version__)? > > was a type, it should be > > $ python -c 'import scipy; print(scipy.__version__)? > > However, I?d recommend looking at the Issue 6706 as Nelson Liu suggested > for further debugging ( > https://github.com/scikit-learn/scikit-learn/issues/6706)! > > Like Maniteja suggested, it is likely due to ?a mismatch between numpy > installed and the one scikit-learn is compiled with" > > Best, > Sebastian > > > On Jun 1, 2016, at 1:55 PM, Ruchika Nayyar > wrote: > > > > Hello Sebastian > > > > Thanks for some insight.. So here are some of my responses > > > > 1) $ python -c 'import numpy; print(numpy.__version__)' > > 1.8.0rc1 > > > > 2) python -c 'import numpy; print(scipy.__version__)' > > Traceback (most recent call last): > > File "", line 1, in > > NameError: name 'scipy' is not defined > > > > > > And I installed everything using pip install numpy/scikit-learn > > and so on. > > But when I tried to do this > > pip install --upgrade scipy > > Requirement already up-to-date: scipy in > /Library/Python/2.7/site-packages > > Requirement already up-to-date: numpy>=1.6.2 in > /Library/Python/2.7/site-packages (from scipy) > > > > > > So not sure why it is not being defined. > > Thanks, > > Ruchika > > ---------------------------------------- > > Dr Ruchika Nayyar, > > Post Doctoral Fellow for ATLAS Collaboration > > University of Arizona > > Arizona, USA. > > -------------------------------------------- > > > > On Wed, Jun 1, 2016 at 10:46 AM, Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > Hi Ruchika, > > > > could you maybe post the results from > > > > $ python -c 'import numpy; print(numpy.__version__)' > > 1.11.0 > > $ python -c 'import numpy; print(scipy.__version__)' > > 0.17.0 > > > > just to make sure that these are indeed the latest versions? However, I > suspect that this is more of a compile rather than a version issue since > scikit should work fine on older versions of NumPy and SciPy ? e.g., one of > the CI tests is running with NUMPY_VERSION=?1.6.2? and > SCIPY_VERSION="0.11.0" > > > > Does NumPy run correctly if you run some examples without scikit-learn? > > E.g., you may want to run > > > > import numpy > > numpy.test('full') > > > > import scipy > > scipy.test('full?) > > > > to narrow down the problem further. > > > > And how did you compile & install scikit-learn? > > > > Best, > > Sebastian > > > > > On Jun 1, 2016, at 1:24 PM, Ruchika Nayyar > wrote: > > > > > > > > > Thanks, > > > Ruchika > > > ---------------------------------------- > > > Dr Ruchika Nayyar, > > > Post Doctoral Fellow for ATLAS Collaboration > > > University of Arizona > > > Arizona, USA. > > > -------------------------------------------- > > > > > > ---------- Forwarded message ---------- > > > From: > > > Date: Wed, Jun 1, 2016 at 10:23 AM > > > Subject: ValueError > > > To: ruchika.work at gmail.com > > > > > > > > > This list allows posts by subscribers only. Please subscribe at > > > https://mail.python.org/mailman/listinfo/scikit-learn to post to the > > > list. > > > > > > > > > > > > ---------- Forwarded message ---------- > > > From: Ruchika Nayyar > > > To: scikit-learn at python.org > > > Cc: > > > Date: Wed, 1 Jun 2016 10:23:14 -0700 > > > Subject: ValueError > > > Hi > > > > > > I am new to scikit-learn and while writing a python script to do a > simple BDT using scikit-learn. I see error when I do this: > > > > > > from sklearn import datasets > > > > > > > > > Traceback (most recent call last): > > > File "bdt.py", line 12, in > > > from sklearn import datasets > > > File "/Library/Python/2.7/site-packages/sklearn/__init__.py", line > 57, in > > > from .base import clone > > > File "/Library/Python/2.7/site-packages/sklearn/base.py", line 11, > in > > > from .utils.fixes import signature > > > File "/Library/Python/2.7/site-packages/sklearn/utils/__init__.py", > line 10, in > > > from .murmurhash import murmurhash3_32 > > > File "numpy.pxd", line 155, in init sklearn.utils.murmurhash > (sklearn/utils/murmurhash.c:5029) > > > ValueError: numpy.dtype has the wrong size, try recompiling > > > > > > > > > I have already tried to uninstall numpy/scipy and pandas. They all are > the latest and compatible version but something is not right. Can you tell > me what I am doing wrong? > > > > > > Thanks, > > > Ruchika > > > ---------------------------------------- > > > Dr Ruchika Nayyar, > > > Post Doctoral Fellow for ATLAS Collaboration > > > University of Arizona > > > Arizona, USA. > > > -------------------------------------------- > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Wed Jun 1 14:07:56 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 1 Jun 2016 11:07:56 -0700 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> Message-ID: Hi, On Wed, Jun 1, 2016 at 11:00 AM, Sebastian Raschka wrote: > Sorry, > > $ python -c 'import numpy; print(scipy.__version__)? > > was a type, it should be > > $ python -c 'import scipy; print(scipy.__version__)? > > However, I?d recommend looking at the Issue 6706 as Nelson Liu suggested for further debugging (https://github.com/scikit-learn/scikit-learn/issues/6706)! > > Like Maniteja suggested, it is likely due to ?a mismatch between numpy installed and the one scikit-learn is compiled with" I think you're using system Python on the Mac. I'd really strongly recommend against that, because system Python has its own numpy and scipy, that aren't in the usual places, and this leads to great confusion when you try and upgrade numpy / scipy / matplotilb. I recommend homebrew Python or Python.org Python instead: https://github.com/MacPython/wiki/wiki/Which-Python Cheers, Matthew From matthew.brett at gmail.com Wed Jun 1 14:12:55 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 1 Jun 2016 11:12:55 -0700 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> Message-ID: On Wed, Jun 1, 2016 at 11:08 AM, Ruchika Nayyar wrote: > I also copy pasted blindly. Actually I see that both numpy and scipy have > older versions > and not the new ones that I want. Let me look into the thread and do more > debugging. Yes, this is very likely because of the problem I pointed to with system Python. Cheers, Matthew From mail at sebastianraschka.com Wed Jun 1 14:17:53 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 1 Jun 2016 14:17:53 -0400 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> Message-ID: <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> > I think you're using system Python on the Mac. I'd really strongly > recommend against that, because system Python Yeah, but I think that the system Python doesn?t come with NumPy and SciPy installed on a Mac? Personally, I am using Conda?s dist., not the system Python python --version Python 3.5.1 :: Continuum Analytics, Inc. > On Jun 1, 2016, at 2:07 PM, Matthew Brett wrote: > > Hi, > > On Wed, Jun 1, 2016 at 11:00 AM, Sebastian Raschka > wrote: >> Sorry, >> >> $ python -c 'import numpy; print(scipy.__version__)? >> >> was a type, it should be >> >> $ python -c 'import scipy; print(scipy.__version__)? >> >> However, I?d recommend looking at the Issue 6706 as Nelson Liu suggested for further debugging (https://github.com/scikit-learn/scikit-learn/issues/6706)! >> >> Like Maniteja suggested, it is likely due to ?a mismatch between numpy installed and the one scikit-learn is compiled with" > > I think you're using system Python on the Mac. I'd really strongly > recommend against that, because system Python has its own numpy and > scipy, that aren't in the usual places, and this leads to great > confusion when you try and upgrade numpy / scipy / matplotilb. I > recommend homebrew Python or Python.org Python instead: > > https://github.com/MacPython/wiki/wiki/Which-Python > > Cheers, > > Matthew > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From sshank at temple.edu Wed Jun 1 14:23:58 2016 From: sshank at temple.edu (STEPHEN D SHANK) Date: Wed, 1 Jun 2016 14:23:58 -0400 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> Message-ID: I also recommend Anaconda. I have installed on several variants of Linux as well as Mac and Windows, and it usually works right out of the box. Anytime it didn't, the issue was nothing that 5 minutes of Googling couldn't solve. I believe that scikit-learn is usually included, and if it isn't their package manager (conda) has almost always served me well. On Wed, Jun 1, 2016 at 2:17 PM, Sebastian Raschka wrote: > > I think you're using system Python on the Mac. I'd really strongly > > recommend against that, because system Python > > Yeah, but I think that the system Python doesn?t come with NumPy and SciPy > installed on a Mac? > Personally, I am using Conda?s dist., not the system Python > > python --version > Python 3.5.1 :: Continuum Analytics, Inc. > > > > On Jun 1, 2016, at 2:07 PM, Matthew Brett > wrote: > > > > Hi, > > > > On Wed, Jun 1, 2016 at 11:00 AM, Sebastian Raschka > > wrote: > >> Sorry, > >> > >> $ python -c 'import numpy; print(scipy.__version__)? > >> > >> was a type, it should be > >> > >> $ python -c 'import scipy; print(scipy.__version__)? > >> > >> However, I?d recommend looking at the Issue 6706 as Nelson Liu > suggested for further debugging ( > https://github.com/scikit-learn/scikit-learn/issues/6706)! > >> > >> Like Maniteja suggested, it is likely due to ?a mismatch between numpy > installed and the one scikit-learn is compiled with" > > > > I think you're using system Python on the Mac. I'd really strongly > > recommend against that, because system Python has its own numpy and > > scipy, that aren't in the usual places, and this leads to great > > confusion when you try and upgrade numpy / scipy / matplotilb. I > > recommend homebrew Python or Python.org Python instead: > > > > https://github.com/MacPython/wiki/wiki/Which-Python > > > > Cheers, > > > > Matthew > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Stephen D. Shank, Ph. D. Department of Biology, Center for Computational Genetics and Genomics Temple University, Philadelphia, PA BioLife 106F sshank at temple.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Wed Jun 1 14:39:03 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 1 Jun 2016 11:39:03 -0700 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> Message-ID: On Wed, Jun 1, 2016 at 11:17 AM, Sebastian Raschka wrote: >> I think you're using system Python on the Mac. I'd really strongly >> recommend against that, because system Python > > Yeah, but I think that the system Python doesn?t come with NumPy and SciPy installed on a Mac? That's the entire problem - system Python has its own private copy of numpy and scipy and matplotlib that are not in the usual sys.path places: $ /usr/bin/python -c 'import numpy; print(numpy.__file__)' /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/__init__.pyc Then, if you try to upgrade them with pip, the new packages are below the private copies in directory precedence, and the effect is that the upgrade is ignored. In effect, system Python is for the system, if you want to own your Python, you need to install another copy for yourself. Cheers, Matthew From andrea.bravi at gmail.com Wed Jun 1 14:43:42 2016 From: andrea.bravi at gmail.com (Andrea Bravi) Date: Wed, 1 Jun 2016 19:43:42 +0100 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> Message-ID: Hi guys, I recommend using https://virtualenv.pypa.io to solve those issues! Best regards, Andrea On Wednesday, 1 June 2016, Matthew Brett wrote: > On Wed, Jun 1, 2016 at 11:17 AM, Sebastian Raschka > > wrote: > >> I think you're using system Python on the Mac. I'd really strongly > >> recommend against that, because system Python > > > > Yeah, but I think that the system Python doesn?t come with NumPy and > SciPy installed on a Mac? > > That's the entire problem - system Python has its own private copy of > numpy and scipy and matplotlib that are not in the usual sys.path > places: > > $ /usr/bin/python -c 'import numpy; print(numpy.__file__)' > > /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/__init__.pyc > > Then, if you try to upgrade them with pip, the new packages are below > the private copies in directory precedence, and the effect is that the > upgrade is ignored. > > In effect, system Python is for the system, if you want to own your > Python, you need to install another copy for yourself. > > Cheers, > > Matthew > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sshank at temple.edu Wed Jun 1 14:45:42 2016 From: sshank at temple.edu (STEPHEN D SHANK) Date: Wed, 1 Jun 2016 14:45:42 -0400 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> Message-ID: Just to note, anaconda also has it's own method of managing environments: On Wed, Jun 1, 2016 at 2:43 PM, Andrea Bravi wrote: > > Hi guys, > > > I recommend using https://virtualenv.pypa.io > to solve those issues! > > > Best regards, > > Andrea > > > On Wednesday, 1 June 2016, Matthew Brett wrote: > >> On Wed, Jun 1, 2016 at 11:17 AM, Sebastian Raschka >> wrote: >> >> I think you're using system Python on the Mac. I'd really strongly >> >> recommend against that, because system Python >> > >> > Yeah, but I think that the system Python doesn?t come with NumPy and >> SciPy installed on a Mac? >> >> That's the entire problem - system Python has its own private copy of >> numpy and scipy and matplotlib that are not in the usual sys.path >> places: >> >> $ /usr/bin/python -c 'import numpy; print(numpy.__file__)' >> >> /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/__init__.pyc >> >> Then, if you try to upgrade them with pip, the new packages are below >> the private copies in directory precedence, and the effect is that the >> upgrade is ignored. >> >> In effect, system Python is for the system, if you want to own your >> Python, you need to install another copy for yourself. >> >> Cheers, >> >> Matthew >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Stephen D. Shank, Ph. D. Department of Biology, Center for Computational Genetics and Genomics Temple University, Philadelphia, PA BioLife 106F sshank at temple.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Wed Jun 1 14:50:45 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 1 Jun 2016 14:50:45 -0400 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> Message-ID: Oh that?s interesting, thanks for the info! I have never used the system Python and didn?t know that it comes with its own NumPy & SciPy; I thought it was more bare bones (sorry about the naive question, but is this a recent thing in Yosemite or has it been always like this?) > On Jun 1, 2016, at 2:39 PM, Matthew Brett wrote: > > On Wed, Jun 1, 2016 at 11:17 AM, Sebastian Raschka > wrote: >>> I think you're using system Python on the Mac. I'd really strongly >>> recommend against that, because system Python >> >> Yeah, but I think that the system Python doesn?t come with NumPy and SciPy installed on a Mac? > > That's the entire problem - system Python has its own private copy of > numpy and scipy and matplotlib that are not in the usual sys.path > places: > > $ /usr/bin/python -c 'import numpy; print(numpy.__file__)' > /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/__init__.pyc > > Then, if you try to upgrade them with pip, the new packages are below > the private copies in directory precedence, and the effect is that the > upgrade is ignored. > > In effect, system Python is for the system, if you want to own your > Python, you need to install another copy for yourself. > > Cheers, > > Matthew > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ruchika.work at gmail.com Wed Jun 1 14:51:47 2016 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Wed, 1 Jun 2016 11:51:47 -0700 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> Message-ID: Much thanks for your advice I was able to sort out the issue with ValueError due to num/scipy I just removed the default one that came with python 2.7 and re-installed latest version. I did the same for matplotlib but if I do this import matplotlib.pyplot as plt It gives me this error: Learning BDTs with TMVA and Scikit /Library/Python/2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.') Traceback (most recent call last): File "bdt.py", line 12, in import matplotlib.pyplot as plt File "/Library/Python/2.7/site-packages/matplotlib/pyplot.py", line 36, in from matplotlib.figure import Figure, figaspect File "/Library/Python/2.7/site-packages/matplotlib/figure.py", line 40, in from matplotlib.axes import Axes, SubplotBase, subplot_class_factory File "/Library/Python/2.7/site-packages/matplotlib/axes/__init__.py", line 4, in from ._subplots import * File "/Library/Python/2.7/site-packages/matplotlib/axes/_subplots.py", line 10, in from matplotlib.axes._axes import Axes File "/Library/Python/2.7/site-packages/matplotlib/axes/_axes.py", line 22, in import matplotlib.dates as _ # <-registers a date unit converter File "/Library/Python/2.7/site-packages/matplotlib/dates.py", line 126, in from dateutil.rrule import (rrule, MO, TU, WE, TH, FR, SA, SU, YEARLY, File "/Library/Python/2.7/site-packages/dateutil/rrule.py", line 19, in from six.moves import _thread ImportError: cannot import name _thread Do you have any idea as to what this could be? Thanks, Ruchika ---------------------------------------- Dr Ruchika Nayyar, Post Doctoral Fellow for ATLAS Collaboration University of Arizona Arizona, USA. -------------------------------------------- On Wed, Jun 1, 2016 at 11:45 AM, STEPHEN D SHANK wrote: > Just to note, anaconda also has it's own method of managing environments: > > On Wed, Jun 1, 2016 at 2:43 PM, Andrea Bravi > wrote: > >> >> Hi guys, >> >> >> I recommend using https://virtualenv.pypa.io >> to solve those issues! >> >> >> Best regards, >> >> Andrea >> >> >> On Wednesday, 1 June 2016, Matthew Brett wrote: >> >>> On Wed, Jun 1, 2016 at 11:17 AM, Sebastian Raschka >>> wrote: >>> >> I think you're using system Python on the Mac. I'd really strongly >>> >> recommend against that, because system Python >>> > >>> > Yeah, but I think that the system Python doesn?t come with NumPy and >>> SciPy installed on a Mac? >>> >>> That's the entire problem - system Python has its own private copy of >>> numpy and scipy and matplotlib that are not in the usual sys.path >>> places: >>> >>> $ /usr/bin/python -c 'import numpy; print(numpy.__file__)' >>> >>> /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/__init__.pyc >>> >>> Then, if you try to upgrade them with pip, the new packages are below >>> the private copies in directory precedence, and the effect is that the >>> upgrade is ignored. >>> >>> In effect, system Python is for the system, if you want to own your >>> Python, you need to install another copy for yourself. >>> >>> Cheers, >>> >>> Matthew >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > Stephen D. Shank, Ph. D. > Department of Biology, Center for Computational Genetics and Genomics > Temple University, Philadelphia, PA > BioLife 106F > sshank at temple.edu > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Wed Jun 1 14:56:06 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 1 Jun 2016 14:56:06 -0400 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> Message-ID: <6554EC64-7B39-408C-A5D5-5C8F3B1806D7@sebastianraschka.com> > I just removed the default one that came with python 2.7 As a general advice, I would never ever mess with the system Python; you could easily break something ? ?important? :P Why not creating a separate virtual environment to be on the safe side? > On Jun 1, 2016, at 2:51 PM, Ruchika Nayyar wrote: > > Much thanks for your advice I was able to sort out the issue with ValueError due to num/scipy > I just removed the default one that came with python 2.7 and re-installed latest version. > I did the same for matplotlib but if I do this > import matplotlib.pyplot as plt > > It gives me this error: > Learning BDTs with TMVA and Scikit > /Library/Python/2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. > warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.') > Traceback (most recent call last): > File "bdt.py", line 12, in > import matplotlib.pyplot as plt > File "/Library/Python/2.7/site-packages/matplotlib/pyplot.py", line 36, in > from matplotlib.figure import Figure, figaspect > File "/Library/Python/2.7/site-packages/matplotlib/figure.py", line 40, in > from matplotlib.axes import Axes, SubplotBase, subplot_class_factory > File "/Library/Python/2.7/site-packages/matplotlib/axes/__init__.py", line 4, in > from ._subplots import * > File "/Library/Python/2.7/site-packages/matplotlib/axes/_subplots.py", line 10, in > from matplotlib.axes._axes import Axes > File "/Library/Python/2.7/site-packages/matplotlib/axes/_axes.py", line 22, in > import matplotlib.dates as _ # <-registers a date unit converter > File "/Library/Python/2.7/site-packages/matplotlib/dates.py", line 126, in > from dateutil.rrule import (rrule, MO, TU, WE, TH, FR, SA, SU, YEARLY, > File "/Library/Python/2.7/site-packages/dateutil/rrule.py", line 19, in > from six.moves import _thread > ImportError: cannot import name _thread > > > Do you have any idea as to what this could be? > > Thanks, > Ruchika > ---------------------------------------- > Dr Ruchika Nayyar, > Post Doctoral Fellow for ATLAS Collaboration > University of Arizona > Arizona, USA. > -------------------------------------------- > > On Wed, Jun 1, 2016 at 11:45 AM, STEPHEN D SHANK wrote: > Just to note, anaconda also has it's own method of managing environments: > > On Wed, Jun 1, 2016 at 2:43 PM, Andrea Bravi wrote: > > Hi guys, > > > I recommend using https://virtualenv.pypa.io to solve those issues! > > > Best regards, > > Andrea > > > On Wednesday, 1 June 2016, Matthew Brett wrote: > On Wed, Jun 1, 2016 at 11:17 AM, Sebastian Raschka > wrote: > >> I think you're using system Python on the Mac. I'd really strongly > >> recommend against that, because system Python > > > > Yeah, but I think that the system Python doesn?t come with NumPy and SciPy installed on a Mac? > > That's the entire problem - system Python has its own private copy of > numpy and scipy and matplotlib that are not in the usual sys.path > places: > > $ /usr/bin/python -c 'import numpy; print(numpy.__file__)' > /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/__init__.pyc > > Then, if you try to upgrade them with pip, the new packages are below > the private copies in directory precedence, and the effect is that the > upgrade is ignored. > > In effect, system Python is for the system, if you want to own your > Python, you need to install another copy for yourself. > > Cheers, > > Matthew > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Stephen D. Shank, Ph. D. > Department of Biology, Center for Computational Genetics and Genomics > Temple University, Philadelphia, PA > BioLife 106F > sshank at temple.edu > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ruchika.work at gmail.com Wed Jun 1 14:55:31 2016 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Wed, 1 Jun 2016 11:55:31 -0700 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> Message-ID: I am more naive than you ;) I have no clue! Thanks, Ruchika ---------------------------------------- Dr Ruchika Nayyar, Post Doctoral Fellow for ATLAS Collaboration University of Arizona Arizona, USA. -------------------------------------------- On Wed, Jun 1, 2016 at 11:50 AM, Sebastian Raschka < mail at sebastianraschka.com> wrote: > Oh that?s interesting, thanks for the info! I have never used the system > Python and didn?t know that it comes with its own NumPy & SciPy; I thought > it was more bare bones (sorry about the naive question, but is this a > recent thing in Yosemite or has it been always like this?) > > > On Jun 1, 2016, at 2:39 PM, Matthew Brett > wrote: > > > > On Wed, Jun 1, 2016 at 11:17 AM, Sebastian Raschka > > wrote: > >>> I think you're using system Python on the Mac. I'd really strongly > >>> recommend against that, because system Python > >> > >> Yeah, but I think that the system Python doesn?t come with NumPy and > SciPy installed on a Mac? > > > > That's the entire problem - system Python has its own private copy of > > numpy and scipy and matplotlib that are not in the usual sys.path > > places: > > > > $ /usr/bin/python -c 'import numpy; print(numpy.__file__)' > > > /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/__init__.pyc > > > > Then, if you try to upgrade them with pip, the new packages are below > > the private copies in directory precedence, and the effect is that the > > upgrade is ignored. > > > > In effect, system Python is for the system, if you want to own your > > Python, you need to install another copy for yourself. > > > > Cheers, > > > > Matthew > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Wed Jun 1 15:03:55 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 1 Jun 2016 12:03:55 -0700 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> Message-ID: On Wed, Jun 1, 2016 at 11:50 AM, Sebastian Raschka wrote: > Oh that?s interesting, thanks for the info! I have never used the system Python and didn?t know that it comes with its own NumPy & SciPy; I thought it was more bare bones (sorry about the naive question, but is this a recent thing in Yosemite or has it been always like this?) It's been like this since at least OSX 10.6, I haven't got anything earlier to test on. Yes, virtualenv is also a good solution, but I'd still prefer homebrew / python.org, because it allows you to install system packages without worrying about breaking things or running into path oddness, as here. Cheers, Matthew From ahowe42 at gmail.com Thu Jun 2 02:22:03 2016 From: ahowe42 at gmail.com (Andrew Howe) Date: Thu, 2 Jun 2016 09:22:03 +0300 Subject: [scikit-learn] Fwd: ValueError In-Reply-To: References: <04702825-F3B2-4CDC-8FDD-158F45BF8613@sebastianraschka.com> <304201F1-D046-4544-8896-B205DA89277B@sebastianraschka.com> Message-ID: I also strongly recommend the Anaconda distribution. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD Editor-in-Chief, European Journal of Mathematical Sciences Executive Editor, European Journal of Pure and Applied Mathematics www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Wed, Jun 1, 2016 at 9:23 PM, STEPHEN D SHANK wrote: > I also recommend Anaconda. I have installed on several variants of Linux > as well as Mac and Windows, and it usually works right out of the box. > Anytime it didn't, the issue was nothing that 5 minutes of Googling > couldn't solve. I believe that scikit-learn is usually included, and if it > isn't their package manager (conda) has almost always served me well. > > On Wed, Jun 1, 2016 at 2:17 PM, Sebastian Raschka < > mail at sebastianraschka.com> wrote: > >> > I think you're using system Python on the Mac. I'd really strongly >> > recommend against that, because system Python >> >> Yeah, but I think that the system Python doesn?t come with NumPy and >> SciPy installed on a Mac? >> Personally, I am using Conda?s dist., not the system Python >> >> python --version >> Python 3.5.1 :: Continuum Analytics, Inc. >> >> >> > On Jun 1, 2016, at 2:07 PM, Matthew Brett >> wrote: >> > >> > Hi, >> > >> > On Wed, Jun 1, 2016 at 11:00 AM, Sebastian Raschka >> > wrote: >> >> Sorry, >> >> >> >> $ python -c 'import numpy; print(scipy.__version__)? >> >> >> >> was a type, it should be >> >> >> >> $ python -c 'import scipy; print(scipy.__version__)? >> >> >> >> However, I?d recommend looking at the Issue 6706 as Nelson Liu >> suggested for further debugging ( >> https://github.com/scikit-learn/scikit-learn/issues/6706)! >> >> >> >> Like Maniteja suggested, it is likely due to ?a mismatch between numpy >> installed and the one scikit-learn is compiled with" >> > >> > I think you're using system Python on the Mac. I'd really strongly >> > recommend against that, because system Python has its own numpy and >> > scipy, that aren't in the usual places, and this leads to great >> > confusion when you try and upgrade numpy / scipy / matplotilb. I >> > recommend homebrew Python or Python.org Python instead: >> > >> > https://github.com/MacPython/wiki/wiki/Which-Python >> > >> > Cheers, >> > >> > Matthew >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > -- > > Stephen D. Shank, Ph. D. > Department of Biology, Center for Computational Genetics and Genomics > Temple University, Philadelphia, PA > BioLife 106F > sshank at temple.edu > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaganadhg at gmail.com Thu Jun 2 15:24:54 2016 From: jaganadhg at gmail.com (JAGANADH G) Date: Thu, 2 Jun 2016 12:24:54 -0700 Subject: [scikit-learn] KMeans with cosine similarity Message-ID: Hi Team, I was trying to use cosine similarity with KMeans. The code which I used to achieve is available here. https://gist.github.com/jaganadhg/b3f6af86ad99bf6e9bb7be21e5baa1b5 Is it the right way to achieve the same ? I know that cosine is not directly supported in sklearn KMeans. But after skimming through the code I was thinking that this will work ;-) -- ********************************** JAGANADH G http://jaganadhg.in *ILUGCBE* http://ilugcbe.org.in -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jun 2 20:36:07 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 3 Jun 2016 10:36:07 +1000 Subject: [scikit-learn] KMeans with cosine similarity In-Reply-To: References: Message-ID: In short, no, monkey patching cosine_similarity in place of euclidean_distances will not work. See for instance this StackOverflow post: http://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric You could try out this Kernel KMeans implementation: https://github.com/scikit-learn/scikit-learn/pull/5483 On 3 June 2016 at 05:24, JAGANADH G wrote: > Hi Team, > I was trying to use cosine similarity with KMeans. The code which I used > to achieve is available here. > https://gist.github.com/jaganadhg/b3f6af86ad99bf6e9bb7be21e5baa1b5 > > Is it the right way to achieve the same ? I know that cosine is not > directly supported in sklearn KMeans. But after skimming through the code I > was thinking that this will work ;-) > > -- > ********************************** > JAGANADH G > http://jaganadhg.in > *ILUGCBE* > http://ilugcbe.org.in > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From blrstartuphire at gmail.com Fri Jun 3 05:18:44 2016 From: blrstartuphire at gmail.com (Startup Hire) Date: Fri, 3 Jun 2016 14:48:44 +0530 Subject: [scikit-learn] Fitting Lognormal Distribution In-Reply-To: References: Message-ID: Hi, Any one call help in above case? Regards, Sanant On Mon, May 30, 2016 at 4:48 PM, Startup Hire wrote: > Thanks to all the replies. > > I was able to write the intial code > > - Refer the charts below.. After the second red point, can I say that the > values of "BLUE" curve will always be higher than "GREEN" curve? > > - The ultimate objective is to find out when the values of blue curve > starts exceeding the values of green curve. > > > > > > Regards, Sanant[image: Inline image 1] > > On Fri, May 27, 2016 at 10:29 PM, Jacob Schreiber > wrote: > >> Another option is to use pomegranate >> which has probability >> distribution fitting with the same API as scikit-learn. You can see a tutorials >> here >> and >> it includes LogNormalDistribution, in addition to a lot of others. All >> distributions also have plotting methods. >> >> On Fri, May 27, 2016 at 6:53 AM, Warren Weckesser < >> warren.weckesser at gmail.com> wrote: >> >>> >>> >>> On Fri, May 27, 2016 at 2:08 AM, Startup Hire >>> wrote: >>> >>>> Hi, >>>> >>>> @ Warren: I was thinking of using federico method as its quite simple. >>>> I know the mu and sigma of log(values) and I need to plot a normal >>>> distribution based on that. Anything inaccurate in doing that? >>>> >>>> >>> >>> Getting mu and sigma from log(values) is fine. That's one of the three >>> methods (the one labeled "Explicit formula") that I included in this >>> answer: >>> http://stackoverflow.com/questions/15630647/fitting-lognormal-distribution-using-scipy-vs-matlab/15632937#15632937 >>> >>> Warren >>> >>> >>> >>>> @ Sebastian: Thanks for your suggestion. I got to know more about >>>> powerlaw distributions. But, I dont think my values have a long tail. do >>>> you think it is still relevant? What are the potential applications of the >>>> same? >>>> >>>> Thanks & Regards, >>>> Sanant >>>> >>>> On Thu, May 26, 2016 at 7:50 PM, Sebastian Benthall < >>>> sbenthall at gmail.com> wrote: >>>> >>>>> You may also be interested in the 'powerlaw' Python package, which >>>>> detects the tail cutoff. >>>>> On May 26, 2016 5:46 AM, "Warren Weckesser" < >>>>> warren.weckesser at gmail.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Thu, May 26, 2016 at 2:08 AM, Startup Hire < >>>>>> blrstartuphire at gmail.com> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Hope you are doing good. >>>>>>> >>>>>>> I am working on a project where I need to do the following things: >>>>>>> >>>>>>> 1. I need to fit a lognormal distribution to a set of values [I know >>>>>>> its lognormal by a simple XY scatter plot in excel] >>>>>>> >>>>>>> >>>>>> >>>>>> The probability distributions in scipy have a fit() method, and >>>>>> scipy.stats.lognorm implements the log-normal distribution ( >>>>>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html) >>>>>> so you can use scipy.lognorm.fit(). See, for example, >>>>>> http://stackoverflow.com/questions/26406056/a-lognormal-distribution-in-python >>>>>> or http://stackoverflow.com/ >>>>>> >>>>>> /questions/15630647/fitting-lognormal-distribution-using-scipy-vs-matlab >>>>>> >>>>>> Warren >>>>>> >>>>>> >>>>>> >>>>>>> 2. I need to find the intersection of the lognormal distribution so >>>>>>> that I can decide cut-off values based on that. >>>>>>> >>>>>>> >>>>>>> Can you guide me on (1) and (2) can be achieved in python? >>>>>>> >>>>>>> Regards, >>>>>>> Sanant >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Curve 1 and Curve 2.png Type: image/png Size: 23093 bytes Desc: not available URL: From michael.eickenberg at gmail.com Fri Jun 3 05:38:41 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Fri, 3 Jun 2016 11:38:41 +0200 Subject: [scikit-learn] Fitting Lognormal Distribution In-Reply-To: References: Message-ID: probably, especially if they are normalised. you have the formulas for those, right? then you can say it for sure. just take the log on both sides. start by plotting the log of both of those distributions and you willprobably see already On Friday, June 3, 2016, Startup Hire wrote: > Hi, > > Any one call help in above case? > > Regards, > Sanant > > On Mon, May 30, 2016 at 4:48 PM, Startup Hire > wrote: > >> Thanks to all the replies. >> >> I was able to write the intial code >> >> - Refer the charts below.. After the second red point, can I say that the >> values of "BLUE" curve will always be higher than "GREEN" curve? >> >> - The ultimate objective is to find out when the values of blue >> curve starts exceeding the values of green curve. >> >> >> >> >> >> Regards, Sanant[image: Inline image 1] >> >> On Fri, May 27, 2016 at 10:29 PM, Jacob Schreiber < >> jmschreiber91 at gmail.com >> > wrote: >> >>> Another option is to use pomegranate >>> which has probability >>> distribution fitting with the same API as scikit-learn. You can see a tutorials >>> here >>> and >>> it includes LogNormalDistribution, in addition to a lot of others. All >>> distributions also have plotting methods. >>> >>> On Fri, May 27, 2016 at 6:53 AM, Warren Weckesser < >>> warren.weckesser at gmail.com >>> > wrote: >>> >>>> >>>> >>>> On Fri, May 27, 2016 at 2:08 AM, Startup Hire >>> > wrote: >>>> >>>>> Hi, >>>>> >>>>> @ Warren: I was thinking of using federico method as its quite simple. >>>>> I know the mu and sigma of log(values) and I need to plot a normal >>>>> distribution based on that. Anything inaccurate in doing that? >>>>> >>>>> >>>> >>>> Getting mu and sigma from log(values) is fine. That's one of the three >>>> methods (the one labeled "Explicit formula") that I included in this >>>> answer: >>>> http://stackoverflow.com/questions/15630647/fitting-lognormal-distribution-using-scipy-vs-matlab/15632937#15632937 >>>> >>>> Warren >>>> >>>> >>>> >>>>> @ Sebastian: Thanks for your suggestion. I got to know more about >>>>> powerlaw distributions. But, I dont think my values have a long tail. do >>>>> you think it is still relevant? What are the potential applications of the >>>>> same? >>>>> >>>>> Thanks & Regards, >>>>> Sanant >>>>> >>>>> On Thu, May 26, 2016 at 7:50 PM, Sebastian Benthall < >>>>> sbenthall at gmail.com >>>>> > wrote: >>>>> >>>>>> You may also be interested in the 'powerlaw' Python package, which >>>>>> detects the tail cutoff. >>>>>> On May 26, 2016 5:46 AM, "Warren Weckesser" < >>>>>> warren.weckesser at gmail.com >>>>>> > wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, May 26, 2016 at 2:08 AM, Startup Hire < >>>>>>> blrstartuphire at gmail.com >>>>>>> > wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> Hope you are doing good. >>>>>>>> >>>>>>>> I am working on a project where I need to do the following things: >>>>>>>> >>>>>>>> 1. I need to fit a lognormal distribution to a set of values [I >>>>>>>> know its lognormal by a simple XY scatter plot in excel] >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> The probability distributions in scipy have a fit() method, and >>>>>>> scipy.stats.lognorm implements the log-normal distribution ( >>>>>>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html) >>>>>>> so you can use scipy.lognorm.fit(). See, for example, >>>>>>> http://stackoverflow.com/questions/26406056/a-lognormal-distribution-in-python >>>>>>> or http://stackoverflow.com/ >>>>>>> >>>>>>> /questions/15630647/fitting-lognormal-distribution-using-scipy-vs-matlab >>>>>>> >>>>>>> Warren >>>>>>> >>>>>>> >>>>>>> >>>>>>>> 2. I need to find the intersection of the lognormal distribution so >>>>>>>> that I can decide cut-off values based on that. >>>>>>>> >>>>>>>> >>>>>>>> Can you guide me on (1) and (2) can be achieved in python? >>>>>>>> >>>>>>>> Regards, >>>>>>>> Sanant >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Curve 1 and Curve 2.png Type: image/png Size: 23093 bytes Desc: not available URL: From blrstartuphire at gmail.com Fri Jun 3 06:02:24 2016 From: blrstartuphire at gmail.com (Startup Hire) Date: Fri, 3 Jun 2016 15:32:24 +0530 Subject: [scikit-learn] Fitting Lognormal Distribution In-Reply-To: References: Message-ID: The above normal distribution is plotted by taking log of the values.. So, you mean to say I can take exp(values) and see whether the criteria is satisfied after the meeting point. Regards, Sanant On Fri, Jun 3, 2016 at 3:08 PM, Michael Eickenberg < michael.eickenberg at gmail.com> wrote: > probably, especially if they are normalised. > you have the formulas for those, right? then you can say it for sure. just > take the log on both sides. start by plotting the log of both of those > distributions and you willprobably see already > > > On Friday, June 3, 2016, Startup Hire wrote: > >> Hi, >> >> Any one call help in above case? >> >> Regards, >> Sanant >> >> On Mon, May 30, 2016 at 4:48 PM, Startup Hire >> wrote: >> >>> Thanks to all the replies. >>> >>> I was able to write the intial code >>> >>> - Refer the charts below.. After the second red point, can I say that >>> the values of "BLUE" curve will always be higher than "GREEN" curve? >>> >>> - The ultimate objective is to find out when the values of blue >>> curve starts exceeding the values of green curve. >>> >>> >>> >>> >>> >>> Regards, Sanant[image: Inline image 1] >>> >>> On Fri, May 27, 2016 at 10:29 PM, Jacob Schreiber < >>> jmschreiber91 at gmail.com> wrote: >>> >>>> Another option is to use pomegranate >>>> which has probability >>>> distribution fitting with the same API as scikit-learn. You can see a tutorials >>>> here >>>> and >>>> it includes LogNormalDistribution, in addition to a lot of others. All >>>> distributions also have plotting methods. >>>> >>>> On Fri, May 27, 2016 at 6:53 AM, Warren Weckesser < >>>> warren.weckesser at gmail.com> wrote: >>>> >>>>> >>>>> >>>>> On Fri, May 27, 2016 at 2:08 AM, Startup Hire < >>>>> blrstartuphire at gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> @ Warren: I was thinking of using federico method as its quite >>>>>> simple. I know the mu and sigma of log(values) and I need to plot a normal >>>>>> distribution based on that. Anything inaccurate in doing that? >>>>>> >>>>>> >>>>> >>>>> Getting mu and sigma from log(values) is fine. That's one of the >>>>> three methods (the one labeled "Explicit formula") that I included in this >>>>> answer: >>>>> http://stackoverflow.com/questions/15630647/fitting-lognormal-distribution-using-scipy-vs-matlab/15632937#15632937 >>>>> >>>>> Warren >>>>> >>>>> >>>>> >>>>>> @ Sebastian: Thanks for your suggestion. I got to know more about >>>>>> powerlaw distributions. But, I dont think my values have a long tail. do >>>>>> you think it is still relevant? What are the potential applications of the >>>>>> same? >>>>>> >>>>>> Thanks & Regards, >>>>>> Sanant >>>>>> >>>>>> On Thu, May 26, 2016 at 7:50 PM, Sebastian Benthall < >>>>>> sbenthall at gmail.com> wrote: >>>>>> >>>>>>> You may also be interested in the 'powerlaw' Python package, which >>>>>>> detects the tail cutoff. >>>>>>> On May 26, 2016 5:46 AM, "Warren Weckesser" < >>>>>>> warren.weckesser at gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, May 26, 2016 at 2:08 AM, Startup Hire < >>>>>>>> blrstartuphire at gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> Hope you are doing good. >>>>>>>>> >>>>>>>>> I am working on a project where I need to do the following things: >>>>>>>>> >>>>>>>>> 1. I need to fit a lognormal distribution to a set of values [I >>>>>>>>> know its lognormal by a simple XY scatter plot in excel] >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> The probability distributions in scipy have a fit() method, and >>>>>>>> scipy.stats.lognorm implements the log-normal distribution ( >>>>>>>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html) >>>>>>>> so you can use scipy.lognorm.fit(). See, for example, >>>>>>>> http://stackoverflow.com/questions/26406056/a-lognormal-distribution-in-python >>>>>>>> or http://stackoverflow.com/ >>>>>>>> >>>>>>>> /questions/15630647/fitting-lognormal-distribution-using-scipy-vs-matlab >>>>>>>> >>>>>>>> Warren >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> 2. I need to find the intersection of the lognormal distribution >>>>>>>>> so that I can decide cut-off values based on that. >>>>>>>>> >>>>>>>>> >>>>>>>>> Can you guide me on (1) and (2) can be achieved in python? >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Sanant >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Curve 1 and Curve 2.png Type: image/png Size: 23093 bytes Desc: not available URL: From michael.eickenberg at gmail.com Fri Jun 3 08:06:56 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Fri, 3 Jun 2016 14:06:56 +0200 Subject: [scikit-learn] Fitting Lognormal Distribution In-Reply-To: References: Message-ID: no, I mean to say log(yaxis) On Fri, Jun 3, 2016 at 12:02 PM, Startup Hire wrote: > The above normal distribution is plotted by taking log of the values.. > > So, you mean to say I can take exp(values) and see whether the criteria is > satisfied after the meeting point. > > Regards, > Sanant > > On Fri, Jun 3, 2016 at 3:08 PM, Michael Eickenberg < > michael.eickenberg at gmail.com> wrote: > >> probably, especially if they are normalised. >> you have the formulas for those, right? then you can say it for sure. >> just take the log on both sides. start by plotting the log of both of those >> distributions and you willprobably see already >> >> >> On Friday, June 3, 2016, Startup Hire wrote: >> >>> Hi, >>> >>> Any one call help in above case? >>> >>> Regards, >>> Sanant >>> >>> On Mon, May 30, 2016 at 4:48 PM, Startup Hire >>> wrote: >>> >>>> Thanks to all the replies. >>>> >>>> I was able to write the intial code >>>> >>>> - Refer the charts below.. After the second red point, can I say that >>>> the values of "BLUE" curve will always be higher than "GREEN" curve? >>>> >>>> - The ultimate objective is to find out when the values of blue >>>> curve starts exceeding the values of green curve. >>>> >>>> >>>> >>>> >>>> >>>> Regards, Sanant[image: Inline image 1] >>>> >>>> On Fri, May 27, 2016 at 10:29 PM, Jacob Schreiber < >>>> jmschreiber91 at gmail.com> wrote: >>>> >>>>> Another option is to use pomegranate >>>>> which has probability >>>>> distribution fitting with the same API as scikit-learn. You can see a tutorials >>>>> here >>>>> and >>>>> it includes LogNormalDistribution, in addition to a lot of others. All >>>>> distributions also have plotting methods. >>>>> >>>>> On Fri, May 27, 2016 at 6:53 AM, Warren Weckesser < >>>>> warren.weckesser at gmail.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, May 27, 2016 at 2:08 AM, Startup Hire < >>>>>> blrstartuphire at gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> @ Warren: I was thinking of using federico method as its quite >>>>>>> simple. I know the mu and sigma of log(values) and I need to plot a normal >>>>>>> distribution based on that. Anything inaccurate in doing that? >>>>>>> >>>>>>> >>>>>> >>>>>> Getting mu and sigma from log(values) is fine. That's one of the >>>>>> three methods (the one labeled "Explicit formula") that I included in this >>>>>> answer: >>>>>> http://stackoverflow.com/questions/15630647/fitting-lognormal-distribution-using-scipy-vs-matlab/15632937#15632937 >>>>>> >>>>>> Warren >>>>>> >>>>>> >>>>>> >>>>>>> @ Sebastian: Thanks for your suggestion. I got to know more about >>>>>>> powerlaw distributions. But, I dont think my values have a long tail. do >>>>>>> you think it is still relevant? What are the potential applications of the >>>>>>> same? >>>>>>> >>>>>>> Thanks & Regards, >>>>>>> Sanant >>>>>>> >>>>>>> On Thu, May 26, 2016 at 7:50 PM, Sebastian Benthall < >>>>>>> sbenthall at gmail.com> wrote: >>>>>>> >>>>>>>> You may also be interested in the 'powerlaw' Python package, which >>>>>>>> detects the tail cutoff. >>>>>>>> On May 26, 2016 5:46 AM, "Warren Weckesser" < >>>>>>>> warren.weckesser at gmail.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, May 26, 2016 at 2:08 AM, Startup Hire < >>>>>>>>> blrstartuphire at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> Hope you are doing good. >>>>>>>>>> >>>>>>>>>> I am working on a project where I need to do the following things: >>>>>>>>>> >>>>>>>>>> 1. I need to fit a lognormal distribution to a set of values [I >>>>>>>>>> know its lognormal by a simple XY scatter plot in excel] >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> The probability distributions in scipy have a fit() method, and >>>>>>>>> scipy.stats.lognorm implements the log-normal distribution ( >>>>>>>>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html) >>>>>>>>> so you can use scipy.lognorm.fit(). See, for example, >>>>>>>>> http://stackoverflow.com/questions/26406056/a-lognormal-distribution-in-python >>>>>>>>> or http://stackoverflow.com/ >>>>>>>>> >>>>>>>>> /questions/15630647/fitting-lognormal-distribution-using-scipy-vs-matlab >>>>>>>>> >>>>>>>>> Warren >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> 2. I need to find the intersection of the lognormal distribution >>>>>>>>>> so that I can decide cut-off values based on that. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Can you guide me on (1) and (2) can be achieved in python? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Sanant >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> scikit-learn mailing list >>>>>>>>>> scikit-learn at python.org >>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Curve 1 and Curve 2.png Type: image/png Size: 23093 bytes Desc: not available URL: From mamunbabu2001 at gmail.com Fri Jun 3 11:54:29 2016 From: mamunbabu2001 at gmail.com (Mamun Rashid) Date: Fri, 3 Jun 2016 16:54:29 +0100 Subject: [scikit-learn] Probability values from OneClassSVM In-Reply-To: References: Message-ID: <0F9B0A34-B222-49C8-AA82-363F30547348@gmail.com> Hi everyone, I am running OneClassSVM method. It seems unlike the normal SVC, which has an option to return probability, this method does not have any option to retrieve probability values. I would like to draw some performance metric such as the ROC and Precision Recall about the performance of the classifier. Thanks, Mamun From goix.nicolas at gmail.com Fri Jun 3 12:16:51 2016 From: goix.nicolas at gmail.com (Nicolas Goix) Date: Fri, 3 Jun 2016 12:16:51 -0400 Subject: [scikit-learn] Probability values from OneClassSVM In-Reply-To: <0F9B0A34-B222-49C8-AA82-363F30547348@gmail.com> References: <0F9B0A34-B222-49C8-AA82-363F30547348@gmail.com> Message-ID: Hi Mamun, You can draw ROC and PR curves using the OCSVM decision_function Nicolas 2016-06-03 11:54 GMT-04:00 Mamun Rashid : > Hi everyone, > I am running OneClassSVM method. It seems unlike the normal SVC, which has > an option to return probability, this method does not have any option to > retrieve probability values. > I would like to draw some performance metric such as the ROC and Precision > Recall about the performance of the classifier. > > Thanks, > Mamun > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Mon Jun 6 08:19:00 2016 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Mon, 6 Jun 2016 14:19:00 +0200 Subject: [scikit-learn] memory efficient feature extraction Message-ID: <57556A34.5000904@gmail.com> Dear all, I was wondering if somebody could advise on the best way for generating/storing large sparse feature sets that do not fit in memory? In particular, I have the following workflow, Large text dataset -> HashingVectorizer -> Feature set in a sparse CSR array on disk -> Training a classifier -> Predictions where the the generated feature set is too large to fit in RAM, however the classifier training can be done in one step (as it uses only certain rows of the CSR array) and the prediction can be split in several steps, all of which fit in memory. Since the training can be performed in one step, I'm not looking for incremental learning out-of-core approaches and saving features to disk for later processing is definitely useful. For instance, if it was possible to save the output of the HashingVectorizer to a single file on disk (using e.g. joblib.dump) then load this file as a memory map (using e.g. joblib.load(.., mmap_mode='r')) everything would work great. Due to memory constraints this cannot be done directly, and the best case scenario is applying HashingVectorizer on chunks of the dataset, which produces a series of sparse CSR arrays on disk. Then, - concatenation of theses arrays into a single CSR array appears to be non-tivial given the memory constraints (e.g. scipy.sparse.vstack transforms all arrays to COO sparse representation internally). - I was not able to find an abstraction layer that would allow to represent these sparse arrays as a single array. For instance, dask could allow to do this for dense arrays ( http://dask.pydata.org/en/latest/array-stack.html ), however support for sparse arrays is only planned at this point ( https://github.com/dask/dask/issues/174 ). Finally, it is not possible to pre-allocate the full array on disk in advance (and access it as a memory map) because we don't know the number of non-zero elements in the sparse array before running the feature extraction. Of course, it is possible to overcome all these difficulties by using a machine with more memory, but my point is rather to have a memory efficient workflow. I would really appreciate any advice on this and would be happy to contribute to a project in the scikit-learn environment aiming to address similar issues, Thank you, Best, -- Roman From joel.nothman at gmail.com Mon Jun 6 08:29:49 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 6 Jun 2016 22:29:49 +1000 Subject: [scikit-learn] memory efficient feature extraction In-Reply-To: <57556A34.5000904@gmail.com> References: <57556A34.5000904@gmail.com> Message-ID: > - concatenation of theses arrays into a single CSR array appears to be > non-tivial given the memory constraints (e.g. scipy.sparse.vstack > transforms all arrays to COO sparse representation internally). There is a fast path for stacking a series of CSR matrices. On 6 June 2016 at 22:19, Roman Yurchak wrote: > Dear all, > > I was wondering if somebody could advise on the best way for > generating/storing large sparse feature sets that do not fit in memory? > In particular, I have the following workflow, > > Large text dataset -> HashingVectorizer -> Feature set in a sparse CSR > array on disk -> Training a classifier -> Predictions > > where the the generated feature set is too large to fit in RAM, however > the classifier training can be done in one step (as it uses only certain > rows of the CSR array) and the prediction can be split in several steps, > all of which fit in memory. Since the training can be performed in one > step, I'm not looking for incremental learning out-of-core approaches > and saving features to disk for later processing is definitely useful. > > For instance, if it was possible to save the output of the > HashingVectorizer to a single file on disk (using e.g. joblib.dump) then > load this file as a memory map (using e.g. joblib.load(.., > mmap_mode='r')) everything would work great. Due to memory constraints > this cannot be done directly, and the best case scenario is applying > HashingVectorizer on chunks of the dataset, which produces a series of > sparse CSR arrays on disk. Then, > - concatenation of theses arrays into a single CSR array appears to be > non-tivial given the memory constraints (e.g. scipy.sparse.vstack > transforms all arrays to COO sparse representation internally). > - I was not able to find an abstraction layer that would allow to > represent these sparse arrays as a single array. For instance, dask > could allow to do this for dense arrays ( > http://dask.pydata.org/en/latest/array-stack.html ), however support for > sparse arrays is only planned at this point ( > https://github.com/dask/dask/issues/174 ). > Finally, it is not possible to pre-allocate the full array on disk in > advance (and access it as a memory map) because we don't know the number > of non-zero elements in the sparse array before running the feature > extraction. > > Of course, it is possible to overcome all these difficulties by using > a machine with more memory, but my point is rather to have a memory > efficient workflow. > > I would really appreciate any advice on this and would be happy to > contribute to a project in the scikit-learn environment aiming to > address similar issues, > > Thank you, > Best, > -- > Roman > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Mon Jun 6 18:27:57 2016 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Tue, 7 Jun 2016 00:27:57 +0200 Subject: [scikit-learn] memory efficient feature extraction In-Reply-To: References: <57556A34.5000904@gmail.com> Message-ID: <5755F8ED.30106@gmail.com> Hi Joel, thanks for your response. On 06/06/16 14:29, Joel Nothman wrote: > - concatenation of theses arrays into a single CSR array appears to be > non-tivial given the memory constraints (e.g. scipy.sparse.vstack > transforms all arrays to COO sparse representation internally). > > There is a fast path for stacking a series of CSR matrices. Could you elaborate a bit more? When the final array is larger than the available memory? Do you mean something along the lines of, 1. Load all arrays of the series as memory maps, and calculate the expected final array shape 2. Allocate the `data`, `indices` and `indptr` arrays on disk using either numpy memory map or HDF5 3. Recalculate `indptr` for each array in the series and fill the 3 resulting arrays 4. Make sure that we can open these files as a scipy CSR array with the ability to load only a subset of rows to memory? I'm just wondering if there is a more standard storage solution in the scikit-learn environment that could be used efficiently with a stateless feature extractor (HashingVectorizer) , Cheers, -- Roman From mamunbabu2001 at gmail.com Mon Jun 6 19:21:15 2016 From: mamunbabu2001 at gmail.com (Mamun Rashid) Date: Tue, 7 Jun 2016 00:21:15 +0100 Subject: [scikit-learn] Probability values from OneClassSVM In-Reply-To: References: <0F9B0A34-B222-49C8-AA82-363F30547348@gmail.com> Message-ID: <71F3752C-90AA-46DD-8DE6-6F72E327611C@gmail.com> Hi Nicolas, Thanks for your reply. Apology for the naive question. I can see from the example that we can plot the decision boundary using the decision function. Not sure how can I extract the ROC and PRC metric from there. A small example would greatly help. Thanks, Mamun > On 3 Jun 2016, at 17:16, Nicolas Goix wrote: > > Hi Mamun, > You can draw ROC and PR curves using the OCSVM decision_function > Nicolas > > 2016-06-03 11:54 GMT-04:00 Mamun Rashid >: > Hi everyone, > I am running OneClassSVM method. It seems unlike the normal SVC, which has an option to return probability, this method does not have any option to retrieve probability values. > I would like to draw some performance metric such as the ROC and Precision Recall about the performance of the classifier. > > Thanks, > Mamun > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From goix.nicolas at gmail.com Mon Jun 6 20:11:51 2016 From: goix.nicolas at gmail.com (Nicolas Goix) Date: Mon, 6 Jun 2016 20:11:51 -0400 Subject: [scikit-learn] Probability values from OneClassSVM In-Reply-To: <71F3752C-90AA-46DD-8DE6-6F72E327611C@gmail.com> References: <0F9B0A34-B222-49C8-AA82-363F30547348@gmail.com> <71F3752C-90AA-46DD-8DE6-6F72E327611C@gmail.com> Message-ID: Hi Mamun, from sklearn.metrics import roc_curve, auc from sklearn.svm import OneClassSVM ocsvm = OneClassSVM().fit(X_train) scoring = - ocsvm.decision_function(X_test) # the lower, the more normal fpr, tpr, thresholds = roc_curve(y_test, scoring) AUC = auc(fpr, tpr) HTH Nicolas 2016-06-06 19:21 GMT-04:00 Mamun Rashid : > Hi Nicolas, > Thanks for your reply. Apology for the naive question. > I can see from the example that we can plot the decision boundary using > the decision function. > Not sure how can I extract the ROC and PRC metric from there. A small > example would greatly help. > > Thanks, > Mamun > > On 3 Jun 2016, at 17:16, Nicolas Goix wrote: > > Hi Mamun, > You can draw ROC and PR curves using the OCSVM decision_function > Nicolas > > 2016-06-03 11:54 GMT-04:00 Mamun Rashid : > >> Hi everyone, >> I am running OneClassSVM method. It seems unlike the normal SVC, which >> has an option to return probability, this method does not have any option >> to retrieve probability values. >> I would like to draw some performance metric such as the ROC and >> Precision Recall about the performance of the classifier. >> >> Thanks, >> Mamun >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ivaylo.petkantchin at gmail.com Tue Jun 7 01:09:17 2016 From: ivaylo.petkantchin at gmail.com (Ivo Petkantchin) Date: Tue, 7 Jun 2016 07:09:17 +0200 Subject: [scikit-learn] NB-SVM Implementation Message-ID: Hello, For a university project I worked on a Sentiment Analysis challenge (Movie Reviews) and implemented a version of NBSVM as described in this paper: http://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf If I am not wrong there is no NBSVM class in scikit-learn. That is why I would like to contribute by coding a NB matrix class if the work is not done by someone else already. Best regards, Ivaylo Petkantchin -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Tue Jun 7 04:11:46 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 7 Jun 2016 10:11:46 +0200 Subject: [scikit-learn] NB-SVM Implementation In-Reply-To: References: Message-ID: I think it could be implemented as a preprocessing step: this is the approach followed by: https://github.com/ryankiros/skip-thoughts/blob/master/eval_classification.py Note that in that case LogisticRegression is used as the final classifier instead of a squared hinge loss SVM but that should not change much in practice. If you want to make this approach scikit-learn compatible (to work with the Pipeline and sklearn's model selection tools for instance) be sure to implement the Transformer API as documented here: http://scikit-learn.org/dev/developers/contributing.html#apis-of-scikit-learn-objects Read the rest of the contributions guide: http://scikit-learn.org/dev/developers NBSVM is quite recent and might not strictly follow the conditions for inclusion as stated in: http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published It already has 163 citations though: https://scholar.google.com/scholar?oi=bibs&hl=en&cites=1710642630990759287 As this is a really strong baseline and the model is not complex and should blend well within the scikit-learn API I would be +1 for inclusion in sklearn. -- Olivier From ivaylo.petkantchin at gmail.com Wed Jun 8 01:07:00 2016 From: ivaylo.petkantchin at gmail.com (Ivo Petkantchin) Date: Wed, 8 Jun 2016 07:07:00 +0200 Subject: [scikit-learn] NB-SVM Implementation In-Reply-To: References: Message-ID: Thank you for your answer ! I will start working on all the requirements for the scikit learn API. 2016-06-07 10:11 GMT+02:00 Olivier Grisel : > I think it could be implemented as a preprocessing step: this is the > approach followed by: > > https://github.com/ryankiros/skip-thoughts/blob/master/eval_classification.py > > Note that in that case LogisticRegression is used as the final > classifier instead of a squared hinge loss SVM but that should not > change much in practice. > > If you want to make this approach scikit-learn compatible (to work > with the Pipeline and sklearn's model selection tools for instance) be > sure to implement the Transformer API as documented here: > > > http://scikit-learn.org/dev/developers/contributing.html#apis-of-scikit-learn-objects > > Read the rest of the contributions guide: > > http://scikit-learn.org/dev/developers > > NBSVM is quite recent and might not strictly follow the conditions for > inclusion as stated in: > > > http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published > > It already has 163 citations though: > > https://scholar.google.com/scholar?oi=bibs&hl=en&cites=1710642630990759287 > > As this is a really strong baseline and the model is not complex and > should blend well within the scikit-learn API I would be +1 for > inclusion in sklearn. > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Wed Jun 8 04:43:05 2016 From: vaggi.federico at gmail.com (federico vaggi) Date: Wed, 08 Jun 2016 08:43:05 +0000 Subject: [scikit-learn] EuroSciPy 2016 Call for Papers Extended Message-ID: Hi everyone, The call for contributions (talks, posters, sprints) is still open until June 24th. EuroSciPy 2016 takes place in Erlangen, Germany, from the 23 to the 27 of August and consists of two days of tutorials (beginner and advanced tracks) and two days of conference representing many fields of science, with a focus on Python tools for science. A day of sprints follows (sprints TBA). The keynote speakers are Ga?l Varoquaux (you might have heard of him) and Abby Cabunoc Mayes and we can expect a rich tutorial and scientific program! Videos from previous years are available at https://www.youtube.com/playlist?list=PLYx7XA2nY5GeQCCugyvtnHMVLdhYlrRxH and https://www.youtube.com/playlist?list=PLYx7XA2nY5Gcpabmu61kKcToLz0FapmHu We are particularly eager to receive proposals from newcomers. EuroSciPy is a very welcoming conference, and we are very curious to hear how you use Python/machine learning in your every day research. Visit us, register and submit an abstract on our website! https://www.euroscipy.org/2016/ SciPythonic regards, The EuroSciPy 2016 team -------------- next part -------------- An HTML attachment was scrubbed... URL: From donkey-hotei at cryptolab.net Thu Jun 9 08:32:49 2016 From: donkey-hotei at cryptolab.net (donkey-hotei at cryptolab.net) Date: Thu, 09 Jun 2016 12:32:49 +0000 Subject: [scikit-learn] partial_fit implementation for IsolationForest In-Reply-To: References: Message-ID: hi nicolas, excuse me, didn't mean to drop this thread for so long. > There is a paper from the same authors as iforest but for streaming > data: http://ijcai.org/Proceedings/11/Papers/254.pdf > > For now it is not cited enough (24) to satisfy the sklearn > requirements. Waiting for more citations, this could be a nice > addition to sklearn-contrib. agreed, I started on a weak implementation of hstree but it is not scikit-learn compatible, let's see what happens... it would be nice to see some guidance here, maybe a new splitter will have to be added? > Otherwise, we could imagine extending iforest to streaming data by > building new > trees when data come (and removing the oldest ones), prediction still > being based on > the average depth of the forest. I'm not sure this heuristic could be > merged on > scikit-learn, since it is not based on well-cited papers. In the same > time, > it is a natural and simple extension of iforest to streaming data... > > Any opinion on it? It is, as I thought a simple extension - my first naive approach was to use the 'warm_start' attribute of the BaseBagging parent class to preserve older estimators and then, in the 'partial_fit' method, we have a loop which deleted popped off some n-number of estimators before calling the original 'fit' method again on incoming data - adding new estimators to the ensemble. We run into the problem of concept drift. Is this the way you'd implement this? if not, how would you approach? thanks so much for reading, isaak From goix.nicolas at gmail.com Thu Jun 9 12:58:52 2016 From: goix.nicolas at gmail.com (Nicolas Goix) Date: Thu, 9 Jun 2016 12:58:52 -0400 Subject: [scikit-learn] partial_fit implementation for IsolationForest In-Reply-To: References: Message-ID: Hi Isaak There is a good review on methods to do online random forests here: https://arxiv.org/pdf/1302.4853.pdf In fact, it turns out that the method of having a "window" of trees is not the best way to do. Usually the trees have to be grown in the same time data arrive, see http://lrs.icg.tugraz.at/pubs/saffari_olcv_09.pdf Adapting ensembles API to online learning seems hard work. But you can open a PR to discuss it. Nicolas On 9 Jun 2016 9:06 am, wrote: > hi nicolas, > excuse me, didn't mean to drop this thread for so long. > > There is a paper from the same authors as iforest but for streaming >> data: http://ijcai.org/Proceedings/11/Papers/254.pdf >> >> For now it is not cited enough (24) to satisfy the sklearn >> requirements. Waiting for more citations, this could be a nice >> addition to sklearn-contrib. >> > > agreed, I started on a weak implementation of hstree but it is not > scikit-learn compatible, > let's see what happens... > it would be nice to see some guidance here, maybe a new splitter will have > to be added? > > Otherwise, we could imagine extending iforest to streaming data by >> building new >> trees when data come (and removing the oldest ones), prediction still >> being based on >> the average depth of the forest. I'm not sure this heuristic could be >> merged on >> scikit-learn, since it is not based on well-cited papers. In the same >> time, >> it is a natural and simple extension of iforest to streaming data... >> >> Any opinion on it? >> > > It is, as I thought a simple extension - my first naive approach was to > use the 'warm_start' attribute > of the BaseBagging parent class to preserve older estimators and then, in > the 'partial_fit' method, we have a loop > which deleted popped off some n-number of estimators before calling the > original 'fit' method again on incoming data - > adding new estimators to the ensemble. > We run into the problem of concept drift. Is this the way you'd implement > this? if not, how would you approach? > > thanks so much for reading, > isaak > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Fri Jun 10 09:31:58 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 10 Jun 2016 15:31:58 +0200 Subject: [scikit-learn] partial_fit implementation for IsolationForest In-Reply-To: References: Message-ID: > However, at present, IsolationForest only fits data in batch even while it may be well suited to incremental on-line learning since one could subsample recent history and older estimators can be dropped progressively. What you describe is quite different from what sklearn models typically do with partial_fit. partial_fit is more about out-of-core / streaming fitting rather than true online learning with explicit forgetting. In particular what you suggest would not accept calling partial_fit with very small chunks (e.g. from tens to a hundred samples at a time) because that would not be enough to develop deep isolation trees and would harm the performance of the resulting isolation forest. If the problem is true online learning (tracking a stream of training data with expected shifts in its distribution) I think it's better to devise a dedicated API that does not try to mimic the scikit-learn API (for this specific part). There will typically have to be an additional hyperparameter to control how much the model should remember about old samples. If the problem is more about out-of-core, then partial_fit is suitable but the trees should grow and get reorganized progressively (as pointed by others in previous comments). BTW, I would be curious to know more about the kind of anomaly detection problem where you found IsolationForests to work well. -- Olivier From donkey-hotei at cryptolab.net Fri Jun 10 09:19:50 2016 From: donkey-hotei at cryptolab.net (donkey-hotei at cryptolab.net) Date: Fri, 10 Jun 2016 15:19:50 +0200 Subject: [scikit-learn] partial_fit implementation for IsolationForest In-Reply-To: References: Message-ID: <09930c407153e12fd56f2972d49cef9b@cryptolab.net> nicolas, > There is a good review on methods to do online random forests here: > > https://arxiv.org/pdf/1302.4853.pdf > > In fact, it turns out that the method of having a "window" of trees is > not the best way to do. Usually the trees have to be grown in the same > time data arrive, see > > http://lrs.icg.tugraz.at/pubs/saffari_olcv_09.pdf > > Adapting ensembles API to online learning seems hard work. But you can > open a PR to discuss it. Thanks a lot for the papers and info. I'll open PR at sometime and see what happens.. ty, isaak From gupta.gaurav0125 at gmail.com Sun Jun 12 13:24:40 2016 From: gupta.gaurav0125 at gmail.com (Gaurav gupta) Date: Sun, 12 Jun 2016 13:24:40 -0400 Subject: [scikit-learn] how to create and execute a machine learning models in Java/JVM based application (in production) using Python Message-ID: Hi All, Could you please guide me on how to *create and execute *a machine learning models/statistical models (regression, Decision tree, K means clustering, Naive bayes, scorecard/linear/logistic regression etc. and GBM, GLM ) in *Java/JVM based application* (in production). We have an ETL sort of Java based product where one can do most of data Preparation steps for machine learning, like data ingestion from JDBC, files, HDFS, No SQL etc., joins and aggregations etc.(which are required for Feature engineering) and now we want to add Analytics capabilities using machine learning/statistical modeling. Right now, we are using JPMML- evaluator to score the models created in PMML format using R and python (and Knime) but it needs three separate and unconnected steps:- 1- first step for data preparation in our Java/JVM application and save the sampling data (training and test) data in csv file or in DB, - ** *2- Create a machine learning Model in R and python (and Knime) and export it in PMML 4.2 format - * 3- Import/deploy the PMML in our Java based application and use JPMML evaluator to execute it in production. ** I am sure it's a common problem in machine learning as generally in Production JAVA is preferred over Python or R. Could you suggest what is the better approach(s) to *create* as well as *execute* a python/scikit based machine learning model in JVM based application. What are your thought to achieve the steps # 2 and #3 more seamlessly in a JVM based application, without compromising *performance and usability*:- 1- Call a java program which internally calls the python scikit script (under the hood) to create a model in PMML and then use JPMML evaluator. It will pretend to the user that he is in a single JVM based application (better usability). I am not sure what are the limitations and short coming of using PMML as not all features are supported in jpmml-sklearn . 2- Call a java program which internally calls the python script and do the model creation as well as execution in an external python environment and serialized the model and the results in a file/csv or in memory DB (or cache, like hazelcast) from where the parent Java application will fetch the results etc.. I researched that I can?t use Jython for executing Sci-kit models . 3- Can I use Jep (Embed Python in Java) to embed Cpython in JVM ? Does anybody tried it for sci-kit models? Alternatively, I should explore to use Mahout or weka - java based machine learning libraries in my JVM based application. (I need to support both windows and non-windows platforms) I am also exploring H2Oai which is java based. Does anybody tried it. Regards Gaurav -------------- next part -------------- An HTML attachment was scrubbed... URL: From yafc18 at gmail.com Mon Jun 13 07:45:42 2016 From: yafc18 at gmail.com (=?UTF-8?B?6aKc5Y+R5omN?=) Date: Mon, 13 Jun 2016 19:45:42 +0800 Subject: [scikit-learn] how to create and execute a machine learning models in Java/JVM based application (in production) using Python In-Reply-To: References: Message-ID: how about spark? It contains some common machine learning algorithms and support JAVA api. On Jun 13, 2016 01:26, "Gaurav gupta" wrote: > > Hi All, > > > > Could you please guide me on how to *create and execute *a machine > learning models/statistical models (regression, Decision tree, K means > clustering, Naive bayes, scorecard/linear/logistic regression etc. and GBM, > GLM ) in *Java/JVM based application* (in production). > > > > We have an ETL sort of Java based product where one can do most of data > Preparation steps for machine learning, like data ingestion from JDBC, > files, HDFS, No SQL etc., joins and aggregations etc.(which are required > for Feature engineering) and now we want to add Analytics capabilities > using machine learning/statistical modeling. > > > > Right now, we are using JPMML- evaluator > to score the models created in > PMML format using R and python (and Knime) but it needs three separate and > unconnected steps:- > > 1- first step for data preparation in our Java/JVM application and save > the sampling data (training and test) data in csv file or in DB, - * BASED application>* > > *2- Create a machine learning Model in R and python (and Knime) and > export it in PMML 4.2 format - * > > 3- Import/deploy the PMML in our Java based application and use JPMML > evaluator to execute it in production. ** > > > > I am sure it's a common problem in machine learning as generally in > Production JAVA is preferred over Python or R. Could you suggest what is > the better approach(s) to *create* as well as *execute* a python/scikit > based machine learning model in JVM based application. > > > > What are your thought to achieve the steps # 2 and #3 more seamlessly in a > JVM based application, without compromising *performance and usability*:- > > > > 1- Call a java program which internally calls the python scikit > script > (under > the hood) to create a model in PMML > and then use JPMML evaluator. It > will pretend to the user that he is in a single JVM based application > (better usability). I am not sure what are the limitations and short coming > of using PMML as not all features are supported in jpmml-sklearn > . > > 2- Call a java program which internally calls the python script and > do the model creation as well as execution in an external python > environment and serialized the model and the results in a file/csv or in > memory DB (or cache, like hazelcast) from where the parent Java application > will fetch the results etc.. I researched that I can?t use Jython for > executing Sci-kit models > > . > > 3- Can I use Jep (Embed Python in Java) > to embed Cpython in JVM ? Does anybody tried it for sci-kit models? > > > > Alternatively, I should explore to use Mahout or weka - java based > machine learning libraries in my JVM based application. (I need to support > both windows and non-windows platforms) > > > > I am also exploring H2Oai which is java based. Does anybody tried it. > > > Regards > > Gaurav > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Jun 13 21:36:52 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 14 Jun 2016 11:36:52 +1000 Subject: [scikit-learn] The culture of commit squashing Message-ID: For the last few years, there's been a notion that we should squash PRs down to a single commit before merging. Squashing can give a cleaner commit history, and avoid overrepresentation of minor work given silly commit count metrics used by Github and others. I'm not sure if there are other motivations. Recently I've seen several contributors amending commits and force-pushing changes. I find this disruptive to the reviewing process in a number of ways (links are broken; what's changed is hard to discern, when it could have been indicated in a commit message and diff; etc.). I have had to ask these several users to cease and desist. I also find that performing the squash can be unnecessary overhead either for the merger or the PR developer. I think squashing is more trouble than it's worth, except where: * there are embarrassingly many minor commits in a PR * squashing first makes a rebase easier because of concurrent changes to the codebase * otherwise for cosmetic reasons only when there is low reviewing activity on the PR While squashing is far from the slowest part of our review process, being able to hit the merge button and move on would be great. Do others agree that a culture of amending commits in the ordinary case is counterproductive? (And apologies for wasting your time on such a silly issue, but I'm sick of clicking links in emails to find the commit's disappeared.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Mon Jun 13 22:40:47 2016 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 13 Jun 2016 19:40:47 -0700 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: My research work involves frequently contributing small changes. I like to keep these around as a record of what I've done, until I've finished with that part of the code. However, I also hate having large numbers of commits (frequently can commit 50+ times a day without much substantitve progress). To combine these, usually I will avoid squashing commits in a branch until right before I merge it. This way you can review everything which has been done until you're finished with that branch, but also avoid having a large number of trivial commits. In this case, only after you've been given MRG +2 would you squash the PR. That would have a negative side effect of preventing the second reviewer from quickly merging the branch, though. What are your thoughts on that, Joel? On Mon, Jun 13, 2016 at 6:36 PM, Joel Nothman wrote: > For the last few years, there's been a notion that we should squash PRs > down to a single commit before merging. Squashing can give a cleaner commit > history, and avoid overrepresentation of minor work given silly commit > count metrics used by Github and others. I'm not sure if there are other > motivations. > > Recently I've seen several contributors amending commits and > force-pushing changes. I find this disruptive to the reviewing process in a > number of ways (links are broken; what's changed is hard to discern, when > it could have been indicated in a commit message and diff; etc.). I have > had to ask these several users to cease and desist. > > I also find that performing the squash can be unnecessary overhead either > for the merger or the PR developer. > > I think squashing is more trouble than it's worth, except where: > * there are embarrassingly many minor commits in a PR > * squashing first makes a rebase easier because of concurrent changes to > the codebase > * otherwise for cosmetic reasons only when there is low reviewing activity > on the PR > > While squashing is far from the slowest part of our review process, being > able to hit the merge button and move on would be great. > > Do others agree that a culture of amending commits in the ordinary case is > counterproductive? > > (And apologies for wasting your time on such a silly issue, but I'm sick > of clicking links in emails to find the commit's disappeared.) > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From basilbeirouti at gmail.com Mon Jun 13 22:44:36 2016 From: basilbeirouti at gmail.com (Basil Beirouti) Date: Mon, 13 Jun 2016 21:44:36 -0500 Subject: [scikit-learn] Adding BM25 relevance function to sklearn.feature_extraction.text Message-ID: Hello all, You can use sklearn.feature_extraction.text.TfidfVectorizer to learn a corpus of documents and rank them in order of relevance to a new previously unseen query. BM25 works in a similar manner to TfidfVectorizer, but is more complex and considered one of the most successful information retrieval algorithms. I currently have code that implements BM25 quite efficiently to learn a corpus of documents and I want to modify/port it to align with the fit-transform framework of sklearn. I think it could fit neatly into the current codebase. Questions: 1.) Would this be a desirable feature? 2.) Any advice for how to proceed with this? Things to watch out for? Any and all advice is welcome. Thanks! Basil -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Jun 13 22:47:48 2016 From: t3kcit at gmail.com (Andy) Date: Mon, 13 Jun 2016 22:47:48 -0400 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: <575F7054.5010708@gmail.com> I agree that it adds some annoying overhead. For me, one of the main motivations is to make cherry picks for bugfix releases easier. It's very hard to cherry pick things that are spread out over many commits, and it's hard to find the relevant bug fixes among hundreds of minor commits. This really mostly impacts me an Olivier and maybe Gael (not sure if you did that at some point, too). It is somewhat related to our practice of merging by rebase, which I think we mostly stopped. When merging by rebase, you don't have merge commits, so finding out which commit belongs to which changeset is hard. I think the rebasing is actually not a great idea for that reason (and also it breaks the nice links from commits to PRs which github provides these days). If we do standard merges and don't squash, I guess the cherry picking is still possible, though somewhat harder. It's not that common a use-case, though, and I guess we can remove the hassle of the rebase if it's too much of a nuisance. On 06/13/2016 09:36 PM, Joel Nothman wrote: > For the last few years, there's been a notion that we should squash > PRs down to a single commit before merging. Squashing can give a > cleaner commit history, and avoid overrepresentation of minor work > given silly commit count metrics used by Github and others. I'm not > sure if there are other motivations. > > Recently I've seen several contributors amending commits and > force-pushing changes. I find this disruptive to the reviewing process > in a number of ways (links are broken; what's changed is hard to > discern, when it could have been indicated in a commit message and > diff; etc.). I have had to ask these several users to cease and desist. > > I also find that performing the squash can be unnecessary overhead > either for the merger or the PR developer. > > I think squashing is more trouble than it's worth, except where: > * there are embarrassingly many minor commits in a PR > * squashing first makes a rebase easier because of concurrent changes > to the codebase > * otherwise for cosmetic reasons only when there is low reviewing > activity on the PR > > While squashing is far from the slowest part of our review process, > being able to hit the merge button and move on would be great. > > Do others agree that a culture of amending commits in the ordinary > case is counterproductive? > > (And apologies for wasting your time on such a silly issue, but I'm > sick of clicking links in emails to find the commit's disappeared.) > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Jun 13 22:48:47 2016 From: t3kcit at gmail.com (Andy) Date: Mon, 13 Jun 2016 22:48:47 -0400 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: <575F708F.8050902@gmail.com> On 06/13/2016 09:36 PM, Joel Nothman wrote: > (And apologies for wasting your time on such a silly issue, but I'm > sick of clicking links in emails to find the commit's disappeared.) I really see no reason why someone should squash something before it is ready to be merged. (as Jacob suggested). From mail at sebastianraschka.com Mon Jun 13 22:53:56 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Mon, 13 Jun 2016 22:53:56 -0400 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: Hi, Joel, in my opinion, it really depends on the particular case, but in general I am pro squashing ? that is, when it happens at the very end. I also agree that squashing and force pushing while there?s still a review going on is clearly disruptive. Say there?s a new estimator being added. This often comes with an insane number of pull requests. - first take on EstimatorX - fix xyz in EstimatorX - address test case issue x - add additional test case to test edge case x - add code example for estimator x - fix typo in documentation for estimator x ? So, once everything looks fine to the reviewers, everyone gave their okay, the CI tests pass, I think there?s nothing against summarizing it to a single commit: - implement EstimatorX In my opinion, it helps tracking down code in the commit history in the long run, but that?s just my personal opinion. Best, Sebastian > On Jun 13, 2016, at 9:36 PM, Joel Nothman wrote: > > For the last few years, there's been a notion that we should squash PRs down to a single commit before merging. Squashing can give a cleaner commit history, and avoid overrepresentation of minor work given silly commit count metrics used by Github and others. I'm not sure if there are other motivations. > > Recently I've seen several contributors amending commits and force-pushing changes. I find this disruptive to the reviewing process in a number of ways (links are broken; what's changed is hard to discern, when it could have been indicated in a commit message and diff; etc.). I have had to ask these several users to cease and desist. > > I also find that performing the squash can be unnecessary overhead either for the merger or the PR developer. > > I think squashing is more trouble than it's worth, except where: > * there are embarrassingly many minor commits in a PR > * squashing first makes a rebase easier because of concurrent changes to the codebase > * otherwise for cosmetic reasons only when there is low reviewing activity on the PR > > While squashing is far from the slowest part of our review process, being able to hit the merge button and move on would be great. > > Do others agree that a culture of amending commits in the ordinary case is counterproductive? > > (And apologies for wasting your time on such a silly issue, but I'm sick of clicking links in emails to find the commit's disappeared.) > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Mon Jun 13 23:02:57 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 14 Jun 2016 13:02:57 +1000 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: My concern is that people are responding to being asked to squash on one PR by squashing during development on the next (as if merge were always imminent). I want that to stop. Is part of the solution to stop squashing, or make the person merging always perform the squash? On 14 June 2016 at 12:53, Sebastian Raschka wrote: > Hi, Joel, > in my opinion, it really depends on the particular case, but in general I > am pro squashing ? that is, when it happens at the very end. I also agree > that squashing and force pushing while there?s still a review going on is > clearly disruptive. > Say there?s a new estimator being added. This often comes with an insane > number of pull requests. > > - first take on EstimatorX > - fix xyz in EstimatorX > - address test case issue x > - add additional test case to test edge case x > - add code example for estimator x > - fix typo in documentation for estimator x > ? > > So, once everything looks fine to the reviewers, everyone gave their okay, > the CI tests pass, I think there?s nothing against summarizing it to a > single commit: > > - implement EstimatorX > > In my opinion, it helps tracking down code in the commit history in the > long run, but that?s just my personal opinion. > > Best, > Sebastian > > > > On Jun 13, 2016, at 9:36 PM, Joel Nothman > wrote: > > > > For the last few years, there's been a notion that we should squash PRs > down to a single commit before merging. Squashing can give a cleaner commit > history, and avoid overrepresentation of minor work given silly commit > count metrics used by Github and others. I'm not sure if there are other > motivations. > > > > Recently I've seen several contributors amending commits and > force-pushing changes. I find this disruptive to the reviewing process in a > number of ways (links are broken; what's changed is hard to discern, when > it could have been indicated in a commit message and diff; etc.). I have > had to ask these several users to cease and desist. > > > > I also find that performing the squash can be unnecessary overhead > either for the merger or the PR developer. > > > > I think squashing is more trouble than it's worth, except where: > > * there are embarrassingly many minor commits in a PR > > * squashing first makes a rebase easier because of concurrent changes to > the codebase > > * otherwise for cosmetic reasons only when there is low reviewing > activity on the PR > > > > While squashing is far from the slowest part of our review process, > being able to hit the merge button and move on would be great. > > > > Do others agree that a culture of amending commits in the ordinary case > is counterproductive? > > > > (And apologies for wasting your time on such a silly issue, but I'm sick > of clicking links in emails to find the commit's disappeared.) > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jni.soma at gmail.com Mon Jun 13 23:40:16 2016 From: jni.soma at gmail.com (Juan Nunez-Iglesias) Date: Tue, 14 Jun 2016 03:40:16 +0000 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: I think the main idea behind commit squashes is to make sure that every *commit* passes testing, rather than only merge commits. This is important because there's no way to tell git bisect to only look at merge commits. So when you are doing a git bisect to hunt down a regression or bug, it is very annoying to have a huge string of commits that don't even build ? which turn a binary search into a linear search. (Some people are concerned that big commit strings aren't proportional to effort, but I think they are certainly *more* proportional than a squash.) So I think that's a worthwhile goal, but one that I think GitHub doesn't support very well. There's a whole demographic of developers that hate GitHub for code review and prefer Gerrit, and I never really understood what they were talking about until I read this blog post a couple of months ago: https://www.beepsend.com/2016/04/05/abandoning-gitflow-github-favour-gerrit/ If you've never used Gerrit, it's very much worth reading in full. Sometimes you don't know what you're missing out on until you see it. I hope that increased pressure from the community will push GitHub to improve their tooling. Finally, in the meantime, St?fan van der Walt and I have started toying with the idea of using Reviewable, which appears to implement most of Gerrit's features with the advantage that it integrates with GitHub: https://reviewable.io/ I hope this helps! Juan. On Mon, Jun 13, 2016 at 11:04 PM Joel Nothman wrote: > My concern is that people are responding to being asked to squash on one > PR by squashing during development on the next (as if merge were always > imminent). I want that to stop. Is part of the solution to stop squashing, > or make the person merging always perform the squash? > > On 14 June 2016 at 12:53, Sebastian Raschka > wrote: > >> Hi, Joel, >> in my opinion, it really depends on the particular case, but in general I >> am pro squashing ? that is, when it happens at the very end. I also agree >> that squashing and force pushing while there?s still a review going on is >> clearly disruptive. >> Say there?s a new estimator being added. This often comes with an insane >> number of pull requests. >> >> - first take on EstimatorX >> - fix xyz in EstimatorX >> - address test case issue x >> - add additional test case to test edge case x >> - add code example for estimator x >> - fix typo in documentation for estimator x >> ? >> >> So, once everything looks fine to the reviewers, everyone gave their >> okay, the CI tests pass, I think there?s nothing against summarizing it to >> a single commit: >> >> - implement EstimatorX >> >> In my opinion, it helps tracking down code in the commit history in the >> long run, but that?s just my personal opinion. >> >> Best, >> Sebastian >> >> >> > On Jun 13, 2016, at 9:36 PM, Joel Nothman >> wrote: >> > >> > For the last few years, there's been a notion that we should squash PRs >> down to a single commit before merging. Squashing can give a cleaner commit >> history, and avoid overrepresentation of minor work given silly commit >> count metrics used by Github and others. I'm not sure if there are other >> motivations. >> > >> > Recently I've seen several contributors amending commits and >> force-pushing changes. I find this disruptive to the reviewing process in a >> number of ways (links are broken; what's changed is hard to discern, when >> it could have been indicated in a commit message and diff; etc.). I have >> had to ask these several users to cease and desist. >> > >> > I also find that performing the squash can be unnecessary overhead >> either for the merger or the PR developer. >> > >> > I think squashing is more trouble than it's worth, except where: >> > * there are embarrassingly many minor commits in a PR >> > * squashing first makes a rebase easier because of concurrent changes >> to the codebase >> > * otherwise for cosmetic reasons only when there is low reviewing >> activity on the PR >> > >> > While squashing is far from the slowest part of our review process, >> being able to hit the merge button and move on would be great. >> > >> > Do others agree that a culture of amending commits in the ordinary case >> is counterproductive? >> > >> > (And apologies for wasting your time on such a silly issue, but I'm >> sick of clicking links in emails to find the commit's disappeared.) >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Jun 14 00:00:55 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 14 Jun 2016 14:00:55 +1000 Subject: [scikit-learn] Adding BM25 relevance function to sklearn.feature_extraction.text In-Reply-To: References: Message-ID: Hi Basil, Scikit-learn isn't a library for information retrieval. The question is: how useful is the BM25 feature reweighting in a machine learning context? This has been previously discussed at https://www.mail-archive.com/scikit-learn-general at lists.sourceforge.net/msg11353.html. The whole thread is worth reading. Despite enthusiasm, it never got as far as a pull request. And still the major burden is showing that this transformation helps for classification/clustering. Joel On 14 June 2016 at 12:44, Basil Beirouti wrote: > Hello all, > > You can use sklearn.feature_extraction.text.TfidfVectorizer to learn a > corpus of documents and rank them in order of relevance to a new previously > unseen query. > > BM25 works in a similar manner to TfidfVectorizer, but is more complex and > considered one of the most successful information retrieval algorithms. > > I currently have code that implements BM25 quite efficiently to learn a > corpus of documents and I want to modify/port it to align with the > fit-transform framework of sklearn. I think it could fit neatly into the > current codebase. > > Questions: > 1.) Would this be a desirable feature? > 2.) Any advice for how to proceed with this? Things to watch out for? > > Any and all advice is welcome. > > Thanks! > Basil > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthieu.brucher at gmail.com Tue Jun 14 02:06:43 2016 From: matthieu.brucher at gmail.com (Matthieu Brucher) Date: Tue, 14 Jun 2016 08:06:43 +0200 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: I don't even think that squashing them before the merge is actually sound. You will still need the history of why something happened several years down the road (and rebasing actually has a similar issue). This bit me quite often (having just one big commit to analyze after a merge from ancient VCS). Git was created to keep the history when merging, why would we explicitly remove knowledge? Just my 2 cents. 2016-06-14 4:40 GMT+02:00 Jacob Schreiber : > My research work involves frequently contributing small changes. I like to > keep these around as a record of what I've done, until I've finished with > that part of the code. However, I also hate having large numbers of commits > (frequently can commit 50+ times a day without much substantitve progress). > To combine these, usually I will avoid squashing commits in a branch until > right before I merge it. This way you can review everything which has been > done until you're finished with that branch, but also avoid having a large > number of trivial commits. In this case, only after you've been given MRG > +2 would you squash the PR. That would have a negative side effect of > preventing the second reviewer from quickly merging the branch, though. > > What are your thoughts on that, Joel? > > On Mon, Jun 13, 2016 at 6:36 PM, Joel Nothman > wrote: > >> For the last few years, there's been a notion that we should squash PRs >> down to a single commit before merging. Squashing can give a cleaner commit >> history, and avoid overrepresentation of minor work given silly commit >> count metrics used by Github and others. I'm not sure if there are other >> motivations. >> >> Recently I've seen several contributors amending commits and >> force-pushing changes. I find this disruptive to the reviewing process in a >> number of ways (links are broken; what's changed is hard to discern, when >> it could have been indicated in a commit message and diff; etc.). I have >> had to ask these several users to cease and desist. >> >> I also find that performing the squash can be unnecessary overhead either >> for the merger or the PR developer. >> >> I think squashing is more trouble than it's worth, except where: >> * there are embarrassingly many minor commits in a PR >> * squashing first makes a rebase easier because of concurrent changes to >> the codebase >> * otherwise for cosmetic reasons only when there is low reviewing >> activity on the PR >> >> While squashing is far from the slowest part of our review process, being >> able to hit the merge button and move on would be great. >> >> Do others agree that a culture of amending commits in the ordinary case >> is counterproductive? >> >> (And apologies for wasting your time on such a silly issue, but I'm sick >> of clicking links in emails to find the commit's disappeared.) >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Information System Engineer, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.duprelatour at orange.fr Tue Jun 14 05:34:50 2016 From: tom.duprelatour at orange.fr (Tom DLT) Date: Tue, 14 Jun 2016 11:34:50 +0200 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: We could stop squashing during development, and use the *new* Squash-and-Merge button on GitHub. What do you think? Tom 2016-06-14 8:06 GMT+02:00 Matthieu Brucher : > I don't even think that squashing them before the merge is actually sound. > You will still need the history of why something happened several years > down the road (and rebasing actually has a similar issue). This bit me > quite often (having just one big commit to analyze after a merge from > ancient VCS). Git was created to keep the history when merging, why would > we explicitly remove knowledge? > Just my 2 cents. > > 2016-06-14 4:40 GMT+02:00 Jacob Schreiber : > >> My research work involves frequently contributing small changes. I like >> to keep these around as a record of what I've done, until I've finished >> with that part of the code. However, I also hate having large numbers of >> commits (frequently can commit 50+ times a day without much substantitve >> progress). To combine these, usually I will avoid squashing commits in a >> branch until right before I merge it. This way you can review everything >> which has been done until you're finished with that branch, but also avoid >> having a large number of trivial commits. In this case, only after you've >> been given MRG +2 would you squash the PR. That would have a negative side >> effect of preventing the second reviewer from quickly merging the branch, >> though. >> >> What are your thoughts on that, Joel? >> >> On Mon, Jun 13, 2016 at 6:36 PM, Joel Nothman >> wrote: >> >>> For the last few years, there's been a notion that we should squash PRs >>> down to a single commit before merging. Squashing can give a cleaner commit >>> history, and avoid overrepresentation of minor work given silly commit >>> count metrics used by Github and others. I'm not sure if there are other >>> motivations. >>> >>> Recently I've seen several contributors amending commits and >>> force-pushing changes. I find this disruptive to the reviewing process in a >>> number of ways (links are broken; what's changed is hard to discern, when >>> it could have been indicated in a commit message and diff; etc.). I have >>> had to ask these several users to cease and desist. >>> >>> I also find that performing the squash can be unnecessary overhead >>> either for the merger or the PR developer. >>> >>> I think squashing is more trouble than it's worth, except where: >>> * there are embarrassingly many minor commits in a PR >>> * squashing first makes a rebase easier because of concurrent changes to >>> the codebase >>> * otherwise for cosmetic reasons only when there is low reviewing >>> activity on the PR >>> >>> While squashing is far from the slowest part of our review process, >>> being able to hit the merge button and move on would be great. >>> >>> Do others agree that a culture of amending commits in the ordinary case >>> is counterproductive? >>> >>> (And apologies for wasting your time on such a silly issue, but I'm sick >>> of clicking links in emails to find the commit's disappeared.) >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Information System Engineer, Ph.D. > Blog: http://blog.audio-tk.com/ > LinkedIn: http://www.linkedin.com/in/matthieubrucher > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at telecom-paristech.fr Tue Jun 14 05:51:58 2016 From: alexandre.gramfort at telecom-paristech.fr (Alexandre Gramfort) Date: Tue, 14 Jun 2016 11:51:58 +0200 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: > We could stop squashing during development, and use the new Squash-and-Merge > button on GitHub. > What do you think? +1 the reason I see for squashing during dev is to avoid killing the browser when reviewing. It really rarely happens though. A From joel.nothman at gmail.com Tue Jun 14 06:58:09 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 14 Jun 2016 20:58:09 +1000 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: Sounds good to me. Thank goodness someone reads the documentation! On 14 June 2016 at 19:51, Alexandre Gramfort < alexandre.gramfort at telecom-paristech.fr> wrote: > > We could stop squashing during development, and use the new > Squash-and-Merge > > button on GitHub. > > What do you think? > > +1 > > the reason I see for squashing during dev is to avoid killing the > browser when reviewing. It really rarely happens though. > > A > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Tue Jun 14 10:56:33 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Tue, 14 Jun 2016 10:56:33 -0400 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: <895B2B8F-BA61-4A4A-92C3-1C2BE78B11AE@sebastianraschka.com> Oh wow, that looks like a neat feature, didn?t know about this, thanks for sharing! (And I would be in favor of this) > On Jun 14, 2016, at 5:34 AM, Tom DLT wrote: > > We could stop squashing during development, and use the new Squash-and-Merge button on GitHub. > What do you think? > Tom > > > 2016-06-14 8:06 GMT+02:00 Matthieu Brucher : > I don't even think that squashing them before the merge is actually sound. You will still need the history of why something happened several years down the road (and rebasing actually has a similar issue). This bit me quite often (having just one big commit to analyze after a merge from ancient VCS). Git was created to keep the history when merging, why would we explicitly remove knowledge? > Just my 2 cents. > > 2016-06-14 4:40 GMT+02:00 Jacob Schreiber : > My research work involves frequently contributing small changes. I like to keep these around as a record of what I've done, until I've finished with that part of the code. However, I also hate having large numbers of commits (frequently can commit 50+ times a day without much substantitve progress). To combine these, usually I will avoid squashing commits in a branch until right before I merge it. This way you can review everything which has been done until you're finished with that branch, but also avoid having a large number of trivial commits. In this case, only after you've been given MRG +2 would you squash the PR. That would have a negative side effect of preventing the second reviewer from quickly merging the branch, though. > > What are your thoughts on that, Joel? > > On Mon, Jun 13, 2016 at 6:36 PM, Joel Nothman wrote: > For the last few years, there's been a notion that we should squash PRs down to a single commit before merging. Squashing can give a cleaner commit history, and avoid overrepresentation of minor work given silly commit count metrics used by Github and others. I'm not sure if there are other motivations. > > Recently I've seen several contributors amending commits and force-pushing changes. I find this disruptive to the reviewing process in a number of ways (links are broken; what's changed is hard to discern, when it could have been indicated in a commit message and diff; etc.). I have had to ask these several users to cease and desist. > > I also find that performing the squash can be unnecessary overhead either for the merger or the PR developer. > > I think squashing is more trouble than it's worth, except where: > * there are embarrassingly many minor commits in a PR > * squashing first makes a rebase easier because of concurrent changes to the codebase > * otherwise for cosmetic reasons only when there is low reviewing activity on the PR > > While squashing is far from the slowest part of our review process, being able to hit the merge button and move on would be great. > > Do others agree that a culture of amending commits in the ordinary case is counterproductive? > > (And apologies for wasting your time on such a silly issue, but I'm sick of clicking links in emails to find the commit's disappeared.) > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > Information System Engineer, Ph.D. > Blog: http://blog.audio-tk.com/ > LinkedIn: http://www.linkedin.com/in/matthieubrucher > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From basilbeirouti at gmail.com Tue Jun 14 11:47:35 2016 From: basilbeirouti at gmail.com (Basil Beirouti) Date: Tue, 14 Jun 2016 10:47:35 -0500 Subject: [scikit-learn] Adding BM25 relevance function Message-ID: Hi Joel, Thanks for your response and for digging up that archived thread, it gives me a lot of clarity. I see your point about BM25, but I think in most cases where TFIDF makes sense, BM25 makes sense as well, but it could be "overkill". Consider that TFIDF does not produce normalized results either , If BM25 requires dimensionality reduction (eg. using LSA) , so too would TFIDF. The term-document matrix is the same size no matter which weighting scheme is used. The only difference is that BM25 produces better results when the corpus is large enough that the term frequency in a document, and the document frequency in the corpus, can vary considerably across a broad range of values.Maybe you could even say TFIDF and BM25 are the same equation except BM25 has a few additional hyperparameters (b and k). So is the advantage that BM25 provides for large diverse corpora with it? or is it marginal? Perhaps you can point me to some more examples where TFIDF is used (in supervised setting preferably) and I can plug in BM25 in place of TFIDF and see how it compares. Here are some I found: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html *(supervised)* http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py (*unsupervised)* Thank you! Basil PS: By the way, I'm not familiar with the delta-idf transform that Pavel mentions in the archive you linked, I'll have to delve deeper into that. I agree with the response to Pavel that he should be putting it in a separate class, not adding on to the TFIDF. I think it would take me about 6-8 weeks to adapt my code to the fit transform model and submit a pull request. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sorianopavel at gmail.com Tue Jun 14 12:11:10 2016 From: sorianopavel at gmail.com (Pavel Soriano) Date: Tue, 14 Jun 2016 16:11:10 +0000 Subject: [scikit-learn] Adding BM25 relevance function In-Reply-To: References: Message-ID: Hey, Good thing that you are trying to finish this. Well, I looked into my old notes, and the Delta tf-idf comes from the "Delta TFIDF: An Improved Feature Space for Sentiment Analysis" paper. I guess it is not very popular and apparently it has a drawback: it does not take into account the number of times a word occurs in each document while calculating the distribution amongst classes. At least that is what I wrote on my notes... As for the delta idf... If it helps, I can look into my old code cause I do not know what I was talking about. I guess it has to do somehow with the paper cited before. Cheers, Pavel Soriano On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti wrote: > Hi Joel, > > Thanks for your response and for digging up that archived thread, it gives > me a lot of clarity. > > I see your point about BM25, but I think in most cases where TFIDF makes > sense, BM25 makes sense as well, but it could be "overkill". > > Consider that TFIDF does not produce normalized results either > , > If BM25 requires dimensionality reduction (eg. using LSA) , so too would > TFIDF. The term-document matrix is the same size no matter which weighting > scheme is used. The only difference is that BM25 produces better results > when the corpus is large enough that the term frequency in a document, and > the document frequency in the corpus, can vary considerably across a broad > range of values.Maybe you could even say TFIDF and BM25 are the same > equation except BM25 has a few additional hyperparameters (b and k). > > So is the advantage that BM25 provides for large diverse corpora with it? > or is it marginal? Perhaps you can point me to some more examples where > TFIDF is used (in supervised setting preferably) and I can plug in BM25 in > place of TFIDF and see how it compares. Here are some I found: > > > http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html > *(supervised)* > > http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py > (*unsupervised)* > > Thank you! > Basil > > PS: By the way, I'm not familiar with the delta-idf transform that Pavel > mentions in the archive you linked, I'll have to delve deeper into that. I > agree with the response to Pavel that he should be putting it in a separate > class, not adding on to the TFIDF. I think it would take me about 6-8 weeks > to adapt my code to the fit transform model and submit a pull request. > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Pavel SORIANO PhD Student ERIC Laboratory Universit? de Lyon -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Jun 14 12:13:29 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 14 Jun 2016 12:13:29 -0400 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: Message-ID: <57602D29.1070203@gmail.com> I'm +1 for using the button when appropriate. I think it should be up to the merging person to make a call whether a squash is a better logical unit than all the commits. I would set like a soft limit at ~5 commits or something. If your PR has more than 5 separate big logical units, it's probably too big. The button is enabled in the settings but I can't see it. Am I being stupid? On 06/14/2016 06:58 AM, Joel Nothman wrote: > Sounds good to me. Thank goodness someone reads the documentation! > > On 14 June 2016 at 19:51, Alexandre Gramfort > > wrote: > > > We could stop squashing during development, and use the new Squash-and-Merge > > button on GitHub. > > What do you think? > > +1 > > the reason I see for squashing during dev is to avoid killing the > browser when reviewing. It really rarely happens though. > > A > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.duprelatour at orange.fr Tue Jun 14 12:40:39 2016 From: tom.duprelatour at orange.fr (Tom DLT) Date: Tue, 14 Jun 2016 18:40:39 +0200 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: <57602D29.1070203@gmail.com> References: <57602D29.1070203@gmail.com> Message-ID: @Andreas It's a bit hidden: You need to click on "Merge pull-request", then do *not* click on "Confirm merge", but on the small arrow to the right, and select "Squash and merge". 2016-06-14 18:13 GMT+02:00 Andreas Mueller : > I'm +1 for using the button when appropriate. > I think it should be up to the merging person to make a call whether a > squash is a better > logical unit than all the commits. > I would set like a soft limit at ~5 commits or something. If your PR has > more than 5 separate > big logical units, it's probably too big. > > The button is enabled in the settings but I can't see it. > Am I being stupid? > > > On 06/14/2016 06:58 AM, Joel Nothman wrote: > > Sounds good to me. Thank goodness someone reads the documentation! > > On 14 June 2016 at 19:51, Alexandre Gramfort < > alexandre.gramfort at telecom-paristech.fr> wrote: > >> > We could stop squashing during development, and use the new >> Squash-and-Merge >> > button on GitHub. >> > What do you think? >> >> +1 >> >> the reason I see for squashing during dev is to avoid killing the >> browser when reviewing. It really rarely happens though. >> >> A >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From basilbeirouti at gmail.com Wed Jun 15 01:01:48 2016 From: basilbeirouti at gmail.com (Basil Beirouti) Date: Wed, 15 Jun 2016 00:01:48 -0500 Subject: [scikit-learn] adding BM25 relevance function Message-ID: Hello Pavel and Joel, I forked the repository and cloned it on my machine. I'm using pycharm on a Mac, and while looking at text.py, I'm getting an unresolved reference for "xrange" at line 28: from ..externals.six.moves import range Pycharm says Function 'six.py' is too large to analyze, so I'm not sure if this error is somehow related to that. I decided to try to build the code as a sanity check but I can't find any reliable instructions as to how to do that. Naively, I opened terminal and cd to the directory above "scikit-learn" folder (where I had cloned my fork) and tried to run: $ python3 setup.py install Which didn't work. I got this error: ImportError: No module named 'sklearn' Can someone point me in the right direction? And how can the code try to import sklearn if it doesn't exist yet? Note I haven't installed the release version of scikit-learn using pip or any other tool, but I should be able to bootstrap it from the source code, right? Here's the full error message if it helps. Forgive me if it's a silly mistake, but I haven't found any reliable guidelines online. File "setup.py", line 84, in from numpy.distutils.core import setup File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/core.py", line 26, in from numpy.distutils.command import config, config_compiler, \ File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/command/build_ext.py", line 18, in from numpy.distutils.system_info import combine_paths File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/system_info.py", line 232, in triplet = str(p.communicate()[0].decode().strip()) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 791, in communicate stdout = _eintr_retry_call(self.stdout.read) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call return func(*args) KeyboardInterrupt Basils-MacBook-Pro:sklearn basilbeirouti$ python3 setup.py install non-existing path in '__check_build': '_check_build.c' Appending sklearn.__check_build configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.__check_build') Appending sklearn._build_utils configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn._build_utils') Appending sklearn.covariance configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance') Appending sklearn.covariance/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance/tests') Appending sklearn.cross_decomposition configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition') Appending sklearn.cross_decomposition/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition/tests') Appending sklearn.feature_selection configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection') Appending sklearn.feature_selection/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection/tests') Appending sklearn.gaussian_process configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process') Appending sklearn.gaussian_process/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process/tests') Appending sklearn.mixture configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture') Appending sklearn.mixture/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture/tests') Appending sklearn.model_selection configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection') Appending sklearn.model_selection/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection/tests') Appending sklearn.neural_network configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network') Appending sklearn.neural_network/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network/tests') Appending sklearn.preprocessing configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing') Appending sklearn.preprocessing/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing/tests') Appending sklearn.semi_supervised configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised') Appending sklearn.semi_supervised/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised/tests') Warning: Assuming default configuration (./_build_utils/{setup__build_utils,setup}.py was not found)Warning: Assuming default configuration (./covariance/{setup_covariance,setup}.py was not found)Warning: Assuming default configuration (./covariance/tests/setup_covariance/{setup_covariance/tests,setup}.py was not found)Warning: Assuming default configuration (./cross_decomposition/{setup_cross_decomposition,setup}.py was not found)Warning: Assuming default configuration (./cross_decomposition/tests/setup_cross_decomposition/{setup_cross_decomposition/tests,setup}.py was not found)Warning: Assuming default configuration (./feature_selection/{setup_feature_selection,setup}.py was not found)Warning: Assuming default configuration (./feature_selection/tests/setup_feature_selection/{setup_feature_selection/tests,setup}.py was not found)Warning: Assuming default configuration (./gaussian_process/{setup_gaussian_process,setup}.py was not found)Warning: Assuming default configuration (./gaussian_process/tests/setup_gaussian_process/{setup_gaussian_process/tests,setup}.py was not found)Warning: Assuming default configuration (./mixture/{setup_mixture,setup}.py was not found)Warning: Assuming default configuration (./mixture/tests/setup_mixture/{setup_mixture/tests,setup}.py was not found)Warning: Assuming default configuration (./model_selection/{setup_model_selection,setup}.py was not found)Warning: Assuming default configuration (./model_selection/tests/setup_model_selection/{setup_model_selection/tests,setup}.py was not found)Warning: Assuming default configuration (./neural_network/{setup_neural_network,setup}.py was not found)Warning: Assuming default configuration (./neural_network/tests/setup_neural_network/{setup_neural_network/tests,setup}.py was not found)Warning: Assuming default configuration (./preprocessing/{setup_preprocessing,setup}.py was not found)Warning: Assuming default configuration (./preprocessing/tests/setup_preprocessing/{setup_preprocessing/tests,setup}.py was not found)Warning: Assuming default configuration (./semi_supervised/{setup_semi_supervised,setup}.py was not found)Warning: Assuming default configuration (./semi_supervised/tests/setup_semi_supervised/{setup_semi_supervised/tests,setup}.py was not found)Traceback (most recent call last): File "setup.py", line 85, in setup(**configuration(top_path='').todict()) File "setup.py", line 44, in configuration config.add_subpackage('cluster') File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 1003, in add_subpackage caller_level = 2) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 972, in get_subpackage caller_level = caller_level + 1) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 884, in _get_configuration_from_setup_py ('.py', 'U', 1)) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", line 234, in load_module return load_source(name, filename, file) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", line 172, in load_source module = _load(spec) File "", line 693, in _load File "", line 673, in _load_unlocked File "", line 662, in exec_module File "", line 222, in _call_with_frames_removed File "./cluster/setup.py", line 8, in from sklearn._build_utils import get_blas_info ImportError: No module named 'sklearn' On Tue, Jun 14, 2016 at 11:41 AM, wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Adding BM25 relevance function (Pavel Soriano) > 2. Re: The culture of commit squashing (Andreas Mueller) > 3. Re: The culture of commit squashing (Tom DLT) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 14 Jun 2016 16:11:10 +0000 > From: Pavel Soriano > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] Adding BM25 relevance function > Message-ID: > < > CAN0wWk93r2aw9No65CGiCW5hQG7-oFYVZaMJQpXpegTXMSqPLg at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hey, > > Good thing that you are trying to finish this. > > Well, I looked into my old notes, and the Delta tf-idf comes from the > "Delta > TFIDF: An Improved Feature Space for Sentiment Analysis" > paper. I guess > it is not very popular and apparently it has a drawback: it does not take > into account the number of times a word occurs in each document while > calculating the distribution amongst classes. At least that is what I wrote > on my notes... > > As for the delta idf... If it helps, I can look into my old code cause I do > not know what I was talking about. I guess it has to do somehow with the > paper cited before. > > Cheers, > > Pavel Soriano > > > > > On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti > wrote: > > > Hi Joel, > > > > Thanks for your response and for digging up that archived thread, it > gives > > me a lot of clarity. > > > > I see your point about BM25, but I think in most cases where TFIDF makes > > sense, BM25 makes sense as well, but it could be "overkill". > > > > Consider that TFIDF does not produce normalized results either > > < > http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py > >, > > If BM25 requires dimensionality reduction (eg. using LSA) , so too would > > TFIDF. The term-document matrix is the same size no matter which > weighting > > scheme is used. The only difference is that BM25 produces better results > > when the corpus is large enough that the term frequency in a document, > and > > the document frequency in the corpus, can vary considerably across a > broad > > range of values.Maybe you could even say TFIDF and BM25 are the same > > equation except BM25 has a few additional hyperparameters (b and k). > > > > So is the advantage that BM25 provides for large diverse corpora with it? > > or is it marginal? Perhaps you can point me to some more examples where > > TFIDF is used (in supervised setting preferably) and I can plug in BM25 > in > > place of TFIDF and see how it compares. Here are some I found: > > > > > > > http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html > > *(supervised)* > > > > > http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py > > (*unsupervised)* > > > > Thank you! > > Basil > > > > PS: By the way, I'm not familiar with the delta-idf transform that Pavel > > mentions in the archive you linked, I'll have to delve deeper into that. > I > > agree with the response to Pavel that he should be putting it in a > separate > > class, not adding on to the TFIDF. I think it would take me about 6-8 > weeks > > to adapt my code to the fit transform model and submit a pull request. > > > > > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Pavel SORIANO > > PhD Student > ERIC Laboratory > Universit? de Lyon > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20160614/cbe49979/attachment-0001.html > > > > ------------------------------ > > Message: 2 > Date: Tue, 14 Jun 2016 12:13:29 -0400 > From: Andreas Mueller > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] The culture of commit squashing > Message-ID: <57602D29.1070203 at gmail.com> > Content-Type: text/plain; charset="windows-1252"; Format="flowed" > > I'm +1 for using the button when appropriate. > I think it should be up to the merging person to make a call whether a > squash is a better > logical unit than all the commits. > I would set like a soft limit at ~5 commits or something. If your PR has > more than 5 separate > big logical units, it's probably too big. > > The button is enabled in the settings but I can't see it. > Am I being stupid? > > On 06/14/2016 06:58 AM, Joel Nothman wrote: > > Sounds good to me. Thank goodness someone reads the documentation! > > > > On 14 June 2016 at 19:51, Alexandre Gramfort > > > > wrote: > > > > > We could stop squashing during development, and use the new > Squash-and-Merge > > > button on GitHub. > > > What do you think? > > > > +1 > > > > the reason I see for squashing during dev is to avoid killing the > > browser when reviewing. It really rarely happens though. > > > > A > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20160614/135d4c27/attachment-0001.html > > > > ------------------------------ > > Message: 3 > Date: Tue, 14 Jun 2016 18:40:39 +0200 > From: Tom DLT > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] The culture of commit squashing > Message-ID: > ph3R6OqsmvZUZDBMjvj09yJwkk0+Yq4EA at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > @Andreas > It's a bit hidden: You need to click on "Merge pull-request", then do *not* > click on "Confirm merge", but on the small arrow to the right, and select > "Squash and merge". > > 2016-06-14 18:13 GMT+02:00 Andreas Mueller : > > > I'm +1 for using the button when appropriate. > > I think it should be up to the merging person to make a call whether a > > squash is a better > > logical unit than all the commits. > > I would set like a soft limit at ~5 commits or something. If your PR has > > more than 5 separate > > big logical units, it's probably too big. > > > > The button is enabled in the settings but I can't see it. > > Am I being stupid? > > > > > > On 06/14/2016 06:58 AM, Joel Nothman wrote: > > > > Sounds good to me. Thank goodness someone reads the documentation! > > > > On 14 June 2016 at 19:51, Alexandre Gramfort < > > alexandre.gramfort at telecom-paristech.fr> wrote: > > > >> > We could stop squashing during development, and use the new > >> Squash-and-Merge > >> > button on GitHub. > >> > What do you think? > >> > >> +1 > >> > >> the reason I see for squashing during dev is to avoid killing the > >> browser when reviewing. It really rarely happens though. > >> > >> A > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > > > > > > > > _______________________________________________ > > scikit-learn mailing listscikit-learn at python.orghttps:// > mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20160614/511d2a1d/attachment.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 3, Issue 27 > ******************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Jun 15 13:53:31 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 15 Jun 2016 13:53:31 -0400 Subject: [scikit-learn] adding BM25 relevance function In-Reply-To: References: Message-ID: <5761961B.7000404@gmail.com> I don't see an unresolved reference to xrange, but I do see that it can't import sklearn. Did you built scikit-learn? See: http://scikit-learn.org/dev/developers/contributing.html#retrieving-the-latest-code\ Either do make or python setup.py build_ext -i or python setup.py develop or pip install . -e (which all do slightly different things) I'd probably go with the first if you have another installation of scikit-learn on your machine and the last if you want to make that your primary installation. Cheers, Andy On 06/15/2016 01:01 AM, Basil Beirouti wrote: > Hello Pavel and Joel, > > I forked the repository and cloned it on my machine. I'm using pycharm > on a Mac, and while looking at text.py, I'm getting an unresolved > reference for "xrange" at line 28: > > from ..externals.six.movesimport range > Pycharm says Function 'six.py' is too large to analyze, so I'm not > sure if this error is somehow related to that. I decided to try to > build the code as a sanity check but I can't find any reliable > instructions as to how to do that. Naively, I opened terminal and cd > to the directory above "scikit-learn" folder (where I had cloned my > fork) and tried to run: > > $ python3 setup.py install > > Which didn't work. I got this error: > > ImportError: No module named 'sklearn' > > Can someone point me in the right direction? And how can the code try > to import sklearn if it doesn't exist yet? Note I haven't installed > the release version of scikit-learn using pip or any other tool, but I > should be able to bootstrap it from the source code, right? > > Here's the full error message if it helps. Forgive me if it's a silly > mistake, but I haven't found any reliable guidelines online. > > File "setup.py", line 84, in > > from numpy.distutils.core import setup > > File > "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/core.py", > line 26, in > > from numpy.distutils.command import config, config_compiler, \ > > File > "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/command/build_ext.py", > line 18, in > > from numpy.distutils.system_info import combine_paths > > File > "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/system_info.py", > line 232, in > > triplet = str(p.communicate()[0].decode().strip()) > > File > "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", > line 791, in communicate > > stdout = _eintr_retry_call(self.stdout.read) > > File > "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", > line 476, in _eintr_retry_call > > return func(*args) > > KeyboardInterrupt > > Basils-MacBook-Pro:sklearn basilbeirouti$ python3 setup.py install > > non-existing path in '__check_build': '_check_build.c' > > Appending sklearn.__check_build configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.__check_build') > > Appending sklearn._build_utils configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn._build_utils') > > Appending sklearn.covariance configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance') > > Appending sklearn.covariance/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.covariance/tests') > > Appending sklearn.cross_decomposition configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.cross_decomposition') > > Appending sklearn.cross_decomposition/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.cross_decomposition/tests') > > Appending sklearn.feature_selection configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.feature_selection') > > Appending sklearn.feature_selection/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.feature_selection/tests') > > Appending sklearn.gaussian_process configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.gaussian_process') > > Appending sklearn.gaussian_process/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.gaussian_process/tests') > > Appending sklearn.mixture configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture') > > Appending sklearn.mixture/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture/tests') > > Appending sklearn.model_selection configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.model_selection') > > Appending sklearn.model_selection/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.model_selection/tests') > > Appending sklearn.neural_network configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.neural_network') > > Appending sklearn.neural_network/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.neural_network/tests') > > Appending sklearn.preprocessing configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing') > > Appending sklearn.preprocessing/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.preprocessing/tests') > > Appending sklearn.semi_supervised configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.semi_supervised') > > Appending sklearn.semi_supervised/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to > 'sklearn.semi_supervised/tests') > > Warning: Assuming default configuration > (./_build_utils/{setup__build_utils,setup}.py was not found)Warning: > Assuming default configuration > (./covariance/{setup_covariance,setup}.py was not found)Warning: > Assuming default configuration > (./covariance/tests/setup_covariance/{setup_covariance/tests,setup}.py > was not found)Warning: Assuming default configuration > (./cross_decomposition/{setup_cross_decomposition,setup}.py was not > found)Warning: Assuming default configuration > (./cross_decomposition/tests/setup_cross_decomposition/{setup_cross_decomposition/tests,setup}.py > was not found)Warning: Assuming default configuration > (./feature_selection/{setup_feature_selection,setup}.py was not > found)Warning: Assuming default configuration > (./feature_selection/tests/setup_feature_selection/{setup_feature_selection/tests,setup}.py > was not found)Warning: Assuming default configuration > (./gaussian_process/{setup_gaussian_process,setup}.py was not > found)Warning: Assuming default configuration > (./gaussian_process/tests/setup_gaussian_process/{setup_gaussian_process/tests,setup}.py > was not found)Warning: Assuming default configuration > (./mixture/{setup_mixture,setup}.py was not found)Warning: Assuming > default configuration > (./mixture/tests/setup_mixture/{setup_mixture/tests,setup}.py was not > found)Warning: Assuming default configuration > (./model_selection/{setup_model_selection,setup}.py was not > found)Warning: Assuming default configuration > (./model_selection/tests/setup_model_selection/{setup_model_selection/tests,setup}.py > was not found)Warning: Assuming default configuration > (./neural_network/{setup_neural_network,setup}.py was not > found)Warning: Assuming default configuration > (./neural_network/tests/setup_neural_network/{setup_neural_network/tests,setup}.py > was not found)Warning: Assuming default configuration > (./preprocessing/{setup_preprocessing,setup}.py was not found)Warning: > Assuming default configuration > (./preprocessing/tests/setup_preprocessing/{setup_preprocessing/tests,setup}.py > was not found)Warning: Assuming default configuration > (./semi_supervised/{setup_semi_supervised,setup}.py was not > found)Warning: Assuming default configuration > (./semi_supervised/tests/setup_semi_supervised/{setup_semi_supervised/tests,setup}.py > was not found)Traceback (most recent call last): > > File "setup.py", line 85, in > > setup(**configuration(top_path='').todict()) > > File "setup.py", line 44, in configuration > > config.add_subpackage('cluster') > > File > "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", > line 1003, in add_subpackage > > caller_level = 2) > > File > "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", > line 972, in get_subpackage > > caller_level = caller_level + 1) > > File > "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", > line 884, in _get_configuration_from_setup_py > > ('.py', 'U', 1)) > > File > "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", > line 234, in load_module > > return load_source(name, filename, file) > > File > "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", > line 172, in load_source > > module = _load(spec) > > File "", line 693, in _load > > File "", line 673, in _load_unlocked > > File "", line 662, in exec_module > > File "", line 222, in > _call_with_frames_removed > > File "./cluster/setup.py", line 8, in > > from sklearn._build_utils import get_blas_info > > ImportError: No module named 'sklearn' > > On Tue, Jun 14, 2016 at 11:41 AM, > wrote: > > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Adding BM25 relevance function (Pavel Soriano) > 2. Re: The culture of commit squashing (Andreas Mueller) > 3. Re: The culture of commit squashing (Tom DLT) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 14 Jun 2016 16:11:10 +0000 > From: Pavel Soriano > > To: Scikit-learn user and developer mailing list > > > Subject: Re: [scikit-learn] Adding BM25 relevance function > Message-ID: > > > > Content-Type: text/plain; charset="utf-8" > > Hey, > > Good thing that you are trying to finish this. > > Well, I looked into my old notes, and the Delta tf-idf comes from > the "Delta > TFIDF: An Improved Feature Space for Sentiment Analysis" > paper. > I guess > it is not very popular and apparently it has a drawback: it does > not take > into account the number of times a word occurs in each document while > calculating the distribution amongst classes. At least that is > what I wrote > on my notes... > > As for the delta idf... If it helps, I can look into my old code > cause I do > not know what I was talking about. I guess it has to do somehow > with the > paper cited before. > > Cheers, > > Pavel Soriano > > > > > On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti > > > wrote: > > > Hi Joel, > > > > Thanks for your response and for digging up that archived > thread, it gives > > me a lot of clarity. > > > > I see your point about BM25, but I think in most cases where > TFIDF makes > > sense, BM25 makes sense as well, but it could be "overkill". > > > > Consider that TFIDF does not produce normalized results either > > > , > > If BM25 requires dimensionality reduction (eg. using LSA) , so > too would > > TFIDF. The term-document matrix is the same size no matter which > weighting > > scheme is used. The only difference is that BM25 produces better > results > > when the corpus is large enough that the term frequency in a > document, and > > the document frequency in the corpus, can vary considerably > across a broad > > range of values.Maybe you could even say TFIDF and BM25 are the same > > equation except BM25 has a few additional hyperparameters (b and k). > > > > So is the advantage that BM25 provides for large diverse corpora > with it? > > or is it marginal? Perhaps you can point me to some more > examples where > > TFIDF is used (in supervised setting preferably) and I can plug > in BM25 in > > place of TFIDF and see how it compares. Here are some I found: > > > > > > > http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html > > *(supervised)* > > > > > http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py > > (*unsupervised)* > > > > Thank you! > > Basil > > > > PS: By the way, I'm not familiar with the delta-idf transform > that Pavel > > mentions in the archive you linked, I'll have to delve deeper > into that. I > > agree with the response to Pavel that he should be putting it in > a separate > > class, not adding on to the TFIDF. I think it would take me > about 6-8 weeks > > to adapt my code to the fit transform model and submit a pull > request. > > > > > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Pavel SORIANO > > PhD Student > ERIC Laboratory > Universit? de Lyon > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Message: 2 > Date: Tue, 14 Jun 2016 12:13:29 -0400 > From: Andreas Mueller > > To: Scikit-learn user and developer mailing list > > > Subject: Re: [scikit-learn] The culture of commit squashing > Message-ID: <57602D29.1070203 at gmail.com > > > Content-Type: text/plain; charset="windows-1252"; Format="flowed" > > I'm +1 for using the button when appropriate. > I think it should be up to the merging person to make a call whether a > squash is a better > logical unit than all the commits. > I would set like a soft limit at ~5 commits or something. If your > PR has > more than 5 separate > big logical units, it's probably too big. > > The button is enabled in the settings but I can't see it. > Am I being stupid? > > On 06/14/2016 06:58 AM, Joel Nothman wrote: > > Sounds good to me. Thank goodness someone reads the documentation! > > > > On 14 June 2016 at 19:51, Alexandre Gramfort > > > > >> wrote: > > > > > We could stop squashing during development, and use the > new Squash-and-Merge > > > button on GitHub. > > > What do you think? > > > > +1 > > > > the reason I see for squashing during dev is to avoid > killing the > > browser when reviewing. It really rarely happens though. > > > > A > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Message: 3 > Date: Tue, 14 Jun 2016 18:40:39 +0200 > From: Tom DLT > > To: Scikit-learn user and developer mailing list > > > Subject: Re: [scikit-learn] The culture of commit squashing > Message-ID: > > > > Content-Type: text/plain; charset="utf-8" > > @Andreas > It's a bit hidden: You need to click on "Merge pull-request", then > do *not* > click on "Confirm merge", but on the small arrow to the right, and > select > "Squash and merge". > > 2016-06-14 18:13 GMT+02:00 Andreas Mueller >: > > > I'm +1 for using the button when appropriate. > > I think it should be up to the merging person to make a call > whether a > > squash is a better > > logical unit than all the commits. > > I would set like a soft limit at ~5 commits or something. If > your PR has > > more than 5 separate > > big logical units, it's probably too big. > > > > The button is enabled in the settings but I can't see it. > > Am I being stupid? > > > > > > On 06/14/2016 06:58 AM, Joel Nothman wrote: > > > > Sounds good to me. Thank goodness someone reads the documentation! > > > > On 14 June 2016 at 19:51, Alexandre Gramfort < > > alexandre.gramfort at telecom-paristech.fr > > wrote: > > > >> > We could stop squashing during development, and use the new > >> Squash-and-Merge > >> > button on GitHub. > >> > What do you think? > >> > >> +1 > >> > >> the reason I see for squashing during dev is to avoid killing the > >> browser when reviewing. It really rarely happens though. > >> > >> A > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > > > > > > > > _______________________________________________ > > scikit-learn mailing > listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 3, Issue 27 > ******************************************* > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Jun 15 22:25:48 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 16 Jun 2016 12:25:48 +1000 Subject: [scikit-learn] adding BM25 relevance function In-Reply-To: <5761961B.7000404@gmail.com> References: <5761961B.7000404@gmail.com> Message-ID: If xrange is the issue, then the branch you're getting may not have been tested for Python 3. On 16 June 2016 at 03:53, Andreas Mueller wrote: > I don't see an unresolved reference to xrange, but I do see that it can't > import sklearn. > Did you built scikit-learn? > See: > > http://scikit-learn.org/dev/developers/contributing.html#retrieving-the-latest-code\ > > Either do > > make > or > python setup.py build_ext -i > or > python setup.py develop > or > pip install . -e > > (which all do slightly different things) > > I'd probably go with the first if you have another installation of > scikit-learn on your machine > and the last if you want to make that your primary installation. > > Cheers, > Andy > > > On 06/15/2016 01:01 AM, Basil Beirouti wrote: > > Hello Pavel and Joel, > > I forked the repository and cloned it on my machine. I'm using pycharm on > a Mac, and while looking at text.py, I'm getting an unresolved reference > for "xrange" at line 28: > > from ..externals.six.moves import range > > Pycharm says Function 'six.py' is too large to analyze, so I'm not sure if this error is somehow related to that. I decided to try to build the code as a sanity check but I can't find any reliable instructions as to how to do that. Naively, I opened terminal and cd to the directory above "scikit-learn" folder (where I had cloned my fork) and tried to run: > > $ python3 setup.py install > > Which didn't work. I got this error: > > ImportError: No module named 'sklearn' > > Can someone point me in the right direction? And how can the code try to import sklearn if it doesn't exist yet? Note I haven't installed the release version of scikit-learn using pip or any other tool, but I should be able to bootstrap it from the source code, right? > > Here's the full error message if it helps. Forgive me if it's a silly mistake, but I haven't found any reliable guidelines online. > > File "setup.py", line 84, in > > from numpy.distutils.core import setup > > File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/core.py", line 26, in > > from numpy.distutils.command import config, config_compiler, \ > > File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/command/build_ext.py", line 18, in > > from numpy.distutils.system_info import combine_paths > > File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/system_info.py", line 232, in > > triplet = str(p.communicate()[0].decode().strip()) > > File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 791, in communicate > > stdout = _eintr_retry_call(self.stdout.read) > > File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call > > return func(*args) > > KeyboardInterrupt > > Basils-MacBook-Pro:sklearn basilbeirouti$ python3 setup.py install > > non-existing path in '__check_build': '_check_build.c' > > Appending sklearn.__check_build configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.__check_build') > > Appending sklearn._build_utils configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn._build_utils') > > Appending sklearn.covariance configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance') > > Appending sklearn.covariance/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance/tests') > > Appending sklearn.cross_decomposition configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition') > > Appending sklearn.cross_decomposition/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition/tests') > > Appending sklearn.feature_selection configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection') > > Appending sklearn.feature_selection/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection/tests') > > Appending sklearn.gaussian_process configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process') > > Appending sklearn.gaussian_process/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process/tests') > > Appending sklearn.mixture configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture') > > Appending sklearn.mixture/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture/tests') > > Appending sklearn.model_selection configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection') > > Appending sklearn.model_selection/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection/tests') > > Appending sklearn.neural_network configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network') > > Appending sklearn.neural_network/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network/tests') > > Appending sklearn.preprocessing configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing') > > Appending sklearn.preprocessing/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing/tests') > > Appending sklearn.semi_supervised configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised') > > Appending sklearn.semi_supervised/tests configuration to sklearn > > Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised/tests') > > Warning: Assuming default configuration (./_build_utils/{setup__build_utils,setup}.py was not found)Warning: Assuming default configuration (./covariance/{setup_covariance,setup}.py was not found)Warning: Assuming default configuration (./covariance/tests/setup_covariance/{setup_covariance/tests,setup}.py was not found)Warning: Assuming default configuration (./cross_decomposition/{setup_cross_decomposition,setup}.py was not found)Warning: Assuming default configuration (./cross_decomposition/tests/setup_cross_decomposition/{setup_cross_decomposition/tests,setup}.py was not found)Warning: Assuming default configuration (./feature_selection/{setup_feature_selection,setup}.py was not found)Warning: Assuming default configuration (./feature_selection/tests/setup_feature_selection/{setup_feature_selection/tests,setup}.py was not found)Warning: Assuming default configuration (./gaussian_process/{s > e > tup_gaussian_process,setup}.py was not found)Warning: Assuming default configuration (./gaussian_process/tests/setup_gaussian_process/{setup_gaussian_process/tests,setup}.py was not found)Warning: Assuming default configuration (./mixture/{setup_mixture,setup}.py was not found)Warning: Assuming default configuration (./mixture/tests/setup_mixture/{setup_mixture/tests,setup}.py was not found)Warning: Assuming default configuration (./model_selection/{setup_model_selection,setup}.py was not found)Warning: Assuming default configuration (./model_selection/tests/setup_model_selection/{setup_model_selection/tests,setup}.py was not found)Warning: Assuming default configuration (./neural_network/{setup_neural_network,setup}.py was not found)Warning: Assuming default configuration (./neural_network/tests/setup_neural_network/{setup_neural_network/tests,setup}.py was not found)Warning: Assuming default configuration (./preprocessing/{setup_preprocessing,setup}.py was not found)Warning: Assumi > n > g default configuration (./preprocessing/tests/setup_preprocessing/{setup_preprocessing/tests,setup}.py was not found)Warning: Assuming default configuration (./semi_supervised/{setup_semi_supervised,setup}.py was not found)Warning: Assuming default configuration (./semi_supervised/tests/setup_semi_supervised/{setup_semi_supervised/tests,setup}.py was not found)Traceback (most recent call last): > > File "setup.py", line 85, in > > setup(**configuration(top_path='').todict()) > > File "setup.py", line 44, in configuration > > config.add_subpackage('cluster') > > File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 1003, in add_subpackage > > caller_level = 2) > > File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 972, in get_subpackage > > caller_level = caller_level + 1) > > File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 884, in _get_configuration_from_setup_py > > ('.py', 'U', 1)) > > File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", line 234, in load_module > > return load_source(name, filename, file) > > File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", line 172, in load_source > > module = _load(spec) > > File "", line 693, in _load > > File "", line 673, in _load_unlocked > > File "", line 662, in exec_module > > File "", line 222, in _call_with_frames_removed > > File "./cluster/setup.py", line 8, in > > from sklearn._build_utils import get_blas_info > > ImportError: No module named 'sklearn' > > On Tue, Jun 14, 2016 at 11:41 AM, wrote: > >> Send scikit-learn mailing list submissions to >> scikit-learn at python.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> https://mail.python.org/mailman/listinfo/scikit-learn >> or, via email, send a message with subject or body 'help' to >> scikit-learn-request at python.org >> >> You can reach the person managing the list at >> scikit-learn-owner at python.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of scikit-learn digest..." >> >> >> Today's Topics: >> >> 1. Re: Adding BM25 relevance function (Pavel Soriano) >> 2. Re: The culture of commit squashing (Andreas Mueller) >> 3. Re: The culture of commit squashing (Tom DLT) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Tue, 14 Jun 2016 16:11:10 +0000 >> From: Pavel Soriano >> To: Scikit-learn user and developer mailing list >> >> Subject: Re: [scikit-learn] Adding BM25 relevance function >> Message-ID: >> < >> CAN0wWk93r2aw9No65CGiCW5hQG7-oFYVZaMJQpXpegTXMSqPLg at mail.gmail.com> >> Content-Type: text/plain; charset="utf-8" >> >> Hey, >> >> Good thing that you are trying to finish this. >> >> Well, I looked into my old notes, and the Delta tf-idf comes from the >> "Delta >> TFIDF: An Improved Feature Space for Sentiment Analysis" >> paper. I guess >> it is not very popular and apparently it has a drawback: it does not take >> into account the number of times a word occurs in each document while >> calculating the distribution amongst classes. At least that is what I >> wrote >> on my notes... >> >> As for the delta idf... If it helps, I can look into my old code cause I >> do >> not know what I was talking about. I guess it has to do somehow with the >> paper cited before. >> >> Cheers, >> >> Pavel Soriano >> >> >> >> >> On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti < >> basilbeirouti at gmail.com> >> wrote: >> >> > Hi Joel, >> > >> > Thanks for your response and for digging up that archived thread, it >> gives >> > me a lot of clarity. >> > >> > I see your point about BM25, but I think in most cases where TFIDF makes >> > sense, BM25 makes sense as well, but it could be "overkill". >> > >> > Consider that TFIDF does not produce normalized results either >> > < >> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py >> >, >> > If BM25 requires dimensionality reduction (eg. using LSA) , so too would >> > TFIDF. The term-document matrix is the same size no matter which >> weighting >> > scheme is used. The only difference is that BM25 produces better results >> > when the corpus is large enough that the term frequency in a document, >> and >> > the document frequency in the corpus, can vary considerably across a >> broad >> > range of values.Maybe you could even say TFIDF and BM25 are the same >> > equation except BM25 has a few additional hyperparameters (b and k). >> > >> > So is the advantage that BM25 provides for large diverse corpora with >> it? >> > or is it marginal? Perhaps you can point me to some more examples where >> > TFIDF is used (in supervised setting preferably) and I can plug in BM25 >> in >> > place of TFIDF and see how it compares. Here are some I found: >> > >> > >> > >> http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html >> > *(supervised)* >> > >> > >> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py >> > (*unsupervised)* >> > >> > Thank you! >> > Basil >> > >> > PS: By the way, I'm not familiar with the delta-idf transform that Pavel >> > mentions in the archive you linked, I'll have to delve deeper into >> that. I >> > agree with the response to Pavel that he should be putting it in a >> separate >> > class, not adding on to the TFIDF. I think it would take me about 6-8 >> weeks >> > to adapt my code to the fit transform model and submit a pull request. >> > >> > >> > >> > >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> -- >> Pavel SORIANO >> >> PhD Student >> ERIC Laboratory >> Universit? de Lyon >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> http://mail.python.org/pipermail/scikit-learn/attachments/20160614/cbe49979/attachment-0001.html >> > >> >> ------------------------------ >> >> Message: 2 >> Date: Tue, 14 Jun 2016 12:13:29 -0400 >> From: Andreas Mueller >> To: Scikit-learn user and developer mailing list >> >> Subject: Re: [scikit-learn] The culture of commit squashing >> Message-ID: <57602D29.1070203 at gmail.com> >> Content-Type: text/plain; charset="windows-1252"; Format="flowed" >> >> I'm +1 for using the button when appropriate. >> I think it should be up to the merging person to make a call whether a >> squash is a better >> logical unit than all the commits. >> I would set like a soft limit at ~5 commits or something. If your PR has >> more than 5 separate >> big logical units, it's probably too big. >> >> The button is enabled in the settings but I can't see it. >> Am I being stupid? >> >> On 06/14/2016 06:58 AM, Joel Nothman wrote: >> > Sounds good to me. Thank goodness someone reads the documentation! >> > >> > On 14 June 2016 at 19:51, Alexandre Gramfort >> > > > > wrote: >> > >> > > We could stop squashing during development, and use the new >> Squash-and-Merge >> > > button on GitHub. >> > > What do you think? >> > >> > +1 >> > >> > the reason I see for squashing during dev is to avoid killing the >> > browser when reviewing. It really rarely happens though. >> > >> > A >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> http://mail.python.org/pipermail/scikit-learn/attachments/20160614/135d4c27/attachment-0001.html >> > >> >> ------------------------------ >> >> Message: 3 >> Date: Tue, 14 Jun 2016 18:40:39 +0200 >> From: Tom DLT >> To: Scikit-learn user and developer mailing list >> >> Subject: Re: [scikit-learn] The culture of commit squashing >> Message-ID: >> > >> ph3R6OqsmvZUZDBMjvj09yJwkk0+Yq4EA at mail.gmail.com> >> Content-Type: text/plain; charset="utf-8" >> >> @Andreas >> It's a bit hidden: You need to click on "Merge pull-request", then do >> *not* >> click on "Confirm merge", but on the small arrow to the right, and select >> "Squash and merge". >> >> 2016-06-14 18:13 GMT+02:00 Andreas Mueller < >> t3kcit at gmail.com>: >> >> > I'm +1 for using the button when appropriate. >> > I think it should be up to the merging person to make a call whether a >> > squash is a better >> > logical unit than all the commits. >> > I would set like a soft limit at ~5 commits or something. If your PR has >> > more than 5 separate >> > big logical units, it's probably too big. >> > >> > The button is enabled in the settings but I can't see it. >> > Am I being stupid? >> > >> > >> > On 06/14/2016 06:58 AM, Joel Nothman wrote: >> > >> > Sounds good to me. Thank goodness someone reads the documentation! >> > >> > On 14 June 2016 at 19:51, Alexandre Gramfort < >> > alexandre.gramfort at telecom-paristech.fr> wrote: >> > >> >> > We could stop squashing during development, and use the new >> >> Squash-and-Merge >> >> > button on GitHub. >> >> > What do you think? >> >> >> >> +1 >> >> >> >> the reason I see for squashing during dev is to avoid killing the >> >> browser when reviewing. It really rarely happens though. >> >> >> >> A >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> > >> > >> > >> > _______________________________________________ >> > scikit-learn mailing listscikit-learn at python.orghttps:// >> mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> http://mail.python.org/pipermail/scikit-learn/attachments/20160614/511d2a1d/attachment.html >> > >> >> ------------------------------ >> >> Subject: Digest Footer >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> ------------------------------ >> >> End of scikit-learn Digest, Vol 3, Issue 27 >> ******************************************* >> > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthieu.brucher at gmail.com Thu Jun 16 09:33:27 2016 From: matthieu.brucher at gmail.com (Matthieu Brucher) Date: Thu, 16 Jun 2016 14:33:27 +0100 Subject: [scikit-learn] Question about error of LLE and backtransformation of coordinates Message-ID: Hi! The errors are quite small compared to the machine precision. As the reduction is also an approximation of the underlying manifold, not an "isotropic" one as well (you can see int he example that red points are less squashed together than blue ones), you won't have a perfect reconstruction either. In a way, if you can reproduce in the reduced space the same distances (or barycenters for LLE) compared the original space, then you can have a perfect reconstruction (but it will still be subject to floating point precision). For the sphere, you can't: take 4 points, can you make the fourth as a barycenter of the other 3? No. That's the error you are seeing. Cheers, Matthieu 2016-06-16 10:12 GMT+01:00 Unger, J?rg : > > I?ve tried the example that is available here > > > > http://scikit-learn.org/stable/auto_examples/manifold/plot_manifold_sphere.html > > > > These are essentially points on a 3D sphere, so the dimension of the embedded manifold is two. > > I?ve changed the example a little bit to extract the error as well. So instead of > > > > trans_data = manifold\ > > .LocallyLinearEmbedding(n_neighbors, 2, > > method=method).fit_transform(sphere_data).T > > > > I?ve done something like > > solver = manifold.LocallyLinearEmbedding(n_neighbors, dim_y, method=method) > > trans_data = solver.fit_transform(sphere_data).T > > error = solver.reconstruction_error_ > > > > I would have expected the error to be significant for dim_y=1, since I can?t reproduce with just a single coordinate the results. For dim_y=2, I expected a significant decrease, and for dim_y=3, I expected to exactly recover the original result. > > > > What I get is (for standard LLE) > > dim_y = 1 : error = 1.62031573333e-07 > > dim_y = 2 : error = 1.79465538543e-06 > > dim_y = 3 : error = 7.00280676182e-06 > > > > Could anyone explain, why I do not get the expected results? > > > > Furthermore, is there an option to retransform the coordinates from the local dimension to the global dimension? I?m interested in transforming the original global samples to local coordinates (this is done via the transform method), but then I would like to transform samples from coordinates in the embedded space back into the global space. > > > > Best regards, > > J?rg F. Unger > > > ------------------------------------------------------------------------------ > What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic > patterns at an interface-level. Reveals which users, apps, and protocols are > consuming the most bandwidth. Provides multi-vendor support for NetFlow, > J-Flow, sFlow and other flows. Make informed decisions using capacity planning > reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421&iu=/41014381 > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Information System Engineer, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher -- Information System Engineer, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher -------------- next part -------------- An HTML attachment was scrubbed... URL: From tully at csc.kth.se Fri Jun 17 11:01:10 2016 From: tully at csc.kth.se (Philip Tully) Date: Fri, 17 Jun 2016 11:01:10 -0400 Subject: [scikit-learn] Estimator.predict() thread safety Message-ID: Hi all, I notice when I train a model and expose the predict function through a web API, predict takes longer to run in a multi-threaded environment than a single-threaded one. I'm guessing the root cause has something to do with thread collisions but I must be doing something incorrectly within the code (I set n_jobs=-1 for both FeatureUnion and estimators/gridsearchers there) has someone else ran into a similar issue? I can provide more details if this Q is rather opaque still best, Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri Jun 17 11:13:16 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 17 Jun 2016 11:13:16 -0400 Subject: [scikit-learn] Estimator.predict() thread safety In-Reply-To: References: Message-ID: <171126A6-BBA1-414A-BDCD-C10BE03DA592@sebastianraschka.com> That?s interesting. I believe n_jobs just wraps multiprocessing through joblib. So, setting it to -1 can essentially spawn as many processes as you have processors available, which may come with an undesired overhead in your environment? I am not sure if this is (still) and issue, but maybe have a look at this one here: https://twiki.cern.ch/twiki/bin/view/Main/PythonLoggingThreadingMultiprocessingIntermixedStudy > Mixing Python modules multiprocessing and threading along with logging error/debug messages with module logging is a very bad idea which leads to unexpected process stalling. Indeed, having concurrent entities based on both multiprocessing.Process and threading.Thread within a single application causes right the multiprocessing.Process to stall and never the threading.Thread, as the attached examples demonstrate. The problematic behaviour demonstrates: Best, Sebastian > On Jun 17, 2016, at 11:01 AM, Philip Tully wrote: > > Hi all, > > I notice when I train a model and expose the predict function through a web API, predict takes longer to run in a multi-threaded environment than a single-threaded one. I'm guessing the root cause has something to do with thread collisions but I must be doing something incorrectly within the code (I set n_jobs=-1 for both FeatureUnion and estimators/gridsearchers there) > > has someone else ran into a similar issue? I can provide more details if this Q is rather opaque still > > best, > Philip > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mail at sebastianraschka.com Fri Jun 17 11:18:33 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 17 Jun 2016 11:18:33 -0400 Subject: [scikit-learn] Estimator.predict() thread safety In-Reply-To: References: Message-ID: > I set n_jobs=-1 for both FeatureUnion and estimators/gridsearchers there I am typically careful with this, e.g., if my machine has 16 cores, I?d set feature union to n_jobs=3 and the gridsearch_cv to n_jobs=4 or so. Curious to hear what the scikit devs think about nesting calls n_jobs=-1; am I too conservative? Best, Sebastian > On Jun 17, 2016, at 11:01 AM, Philip Tully wrote: > > Hi all, > > I notice when I train a model and expose the predict function through a web API, predict takes longer to run in a multi-threaded environment than a single-threaded one. I'm guessing the root cause has something to do with thread collisions but I must be doing something incorrectly within the code (I set n_jobs=-1 for both FeatureUnion and estimators/gridsearchers there) > > has someone else ran into a similar issue? I can provide more details if this Q is rather opaque still > > best, > Philip > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Fri Jun 17 11:21:41 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 17 Jun 2016 17:21:41 +0200 Subject: [scikit-learn] Estimator.predict() thread safety In-Reply-To: References: Message-ID: <20160617152141.GD2458470@phare.normalesup.org> > I am typically careful with this, e.g., if my machine has 16 cores, I?d > set feature union to n_jobs=3 and the gridsearch_cv to n_jobs=4 or so. > Curious to hear what the scikit devs think about nesting calls > n_jobs=-1; am I too conservative? Nested parallelism doesn't work. It's a limitation of multiprocessing. From tully at csc.kth.se Fri Jun 17 11:46:59 2016 From: tully at csc.kth.se (Philip Tully) Date: Fri, 17 Jun 2016 11:46:59 -0400 Subject: [scikit-learn] Estimator.predict() thread safety In-Reply-To: <20160617152141.GD2458470@phare.normalesup.org> References: <20160617152141.GD2458470@phare.normalesup.org> Message-ID: Gotcha - so perhaps I should ensure FeatureUnion[n_jobs] + GirdSearch[n_jobs] < # cores? On Fri, Jun 17, 2016 at 11:21 AM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > > I am typically careful with this, e.g., if my machine has 16 cores, I?d > > set feature union to n_jobs=3 and the gridsearch_cv to n_jobs=4 or so. > > Curious to hear what the scikit devs think about nesting calls > > n_jobs=-1; am I too conservative? > > Nested parallelism doesn't work. It's a limitation of multiprocessing. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Fri Jun 17 11:51:55 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 17 Jun 2016 17:51:55 +0200 Subject: [scikit-learn] Estimator.predict() thread safety In-Reply-To: References: <20160617152141.GD2458470@phare.normalesup.org> Message-ID: <20160617155155.GS654315@phare.normalesup.org> No, the inner loop won't be run parallel. G On Fri, Jun 17, 2016 at 11:46:59AM -0400, Philip Tully wrote: > Gotcha - so perhaps I should ensure FeatureUnion[n_jobs] + GirdSearch[n_jobs] < > # cores? > On Fri, Jun 17, 2016 at 11:21 AM, Gael Varoquaux > wrote: > > I am typically careful with this, e.g., if my machine has 16 cores, I?d > > set feature union to n_jobs=3 and the gridsearch_cv to n_jobs=4 or so. > > Curious to hear what the scikit devs think about nesting calls > > n_jobs=-1; am I too conservative? > Nested parallelism doesn't work. It's a limitation of multiprocessing. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From mail at sebastianraschka.com Fri Jun 17 11:52:25 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 17 Jun 2016 11:52:25 -0400 Subject: [scikit-learn] Estimator.predict() thread safety In-Reply-To: References: <20160617152141.GD2458470@phare.normalesup.org> Message-ID: <08297C98-DA8F-4253-84A0-102C0233F771@sebastianraschka.com> I think > FeatureUnion[n_jobs=1] + GirdSearch[n_jobs <= cores] would be better regarding the nested parallelism limitation > On Jun 17, 2016, at 11:46 AM, Philip Tully wrote: > > Gotcha - so perhaps I should ensure FeatureUnion[n_jobs] + GirdSearch[n_jobs] < # cores? > > On Fri, Jun 17, 2016 at 11:21 AM, Gael Varoquaux wrote: > > I am typically careful with this, e.g., if my machine has 16 cores, I?d > > set feature union to n_jobs=3 and the gridsearch_cv to n_jobs=4 or so. > > Curious to hear what the scikit devs think about nesting calls > > n_jobs=-1; am I too conservative? > > Nested parallelism doesn't work. It's a limitation of multiprocessing. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tully at csc.kth.se Fri Jun 17 12:14:57 2016 From: tully at csc.kth.se (Philip Tully) Date: Fri, 17 Jun 2016 12:14:57 -0400 Subject: [scikit-learn] Estimator.predict() thread safety In-Reply-To: <08297C98-DA8F-4253-84A0-102C0233F771@sebastianraschka.com> References: <20160617152141.GD2458470@phare.normalesup.org> <08297C98-DA8F-4253-84A0-102C0233F771@sebastianraschka.com> Message-ID: Can confirm FeatureUnion[n_jobs=1] solved my issue, fwiw Thanks for prompt replies On Fri, Jun 17, 2016 at 11:52 AM, Sebastian Raschka < mail at sebastianraschka.com> wrote: > I think > > > FeatureUnion[n_jobs=1] + GirdSearch[n_jobs <= cores] > > would be better regarding the nested parallelism limitation > > > On Jun 17, 2016, at 11:46 AM, Philip Tully wrote: > > > > Gotcha - so perhaps I should ensure FeatureUnion[n_jobs] + > GirdSearch[n_jobs] < # cores? > > > > On Fri, Jun 17, 2016 at 11:21 AM, Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > > > I am typically careful with this, e.g., if my machine has 16 cores, I?d > > > set feature union to n_jobs=3 and the gridsearch_cv to n_jobs=4 or so. > > > Curious to hear what the scikit devs think about nesting calls > > > n_jobs=-1; am I too conservative? > > > > Nested parallelism doesn't work. It's a limitation of multiprocessing. > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sagarkar10 at gmail.com Sat Jun 18 05:08:57 2016 From: sagarkar10 at gmail.com (Sagar Kar) Date: Sat, 18 Jun 2016 14:38:57 +0530 Subject: [scikit-learn] New Contributor Message-ID: Hi Developers, I am new to scikit-learn and want some guidance to start contributing to it. I am well versed with python and c/c++. Linux user. Can anyone help me get me going. Regards, *Sagar Kar* P: +91-9085986905 W: sagarkar10.github.io -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sat Jun 18 05:18:08 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 18 Jun 2016 11:18:08 +0200 Subject: [scikit-learn] New Contributor In-Reply-To: References: Message-ID: <20160618091808.GH2458470@phare.normalesup.org> The best thing to do is to read the contributors documentation, and then get started with an issue labelled easy. Cheers, Ga?l From sagarkar10 at gmail.com Sat Jun 18 05:33:17 2016 From: sagarkar10 at gmail.com (Sagar Kar) Date: Sat, 18 Jun 2016 15:03:17 +0530 Subject: [scikit-learn] New Contributor In-Reply-To: <20160618091808.GH2458470@phare.normalesup.org> References: <20160618091808.GH2458470@phare.normalesup.org> Message-ID: Thanks Gael, I read it. But I am having hard time finding an issue to work on. Frankly, I am unable to understand how to approach the easy issues also. Sorry for being naive. Regards, *Sagar Kar* P: +91-9085986905 W: sagarkar10.github.io On Sat, Jun 18, 2016 at 2:48 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > The best thing to do is to read the contributors documentation, and then > get started with an issue labelled easy. > > Cheers, > > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthieu.brucher at gmail.com Sat Jun 18 06:20:15 2016 From: matthieu.brucher at gmail.com (Matthieu Brucher) Date: Sat, 18 Jun 2016 11:20:15 +0100 Subject: [scikit-learn] New Contributor In-Reply-To: References: <20160618091808.GH2458470@phare.normalesup.org> Message-ID: You can also try one, and if you are stuck, just ask for help. Someone should be able to help you out ;) 2016-06-18 10:33 GMT+01:00 Sagar Kar : > Thanks Gael, > I read it. But I am having hard time finding an issue to work on. Frankly, > I am unable to understand how to approach the easy issues also. > Sorry for being naive. > > > > Regards, > *Sagar Kar* > P: +91-9085986905 > W: sagarkar10.github.io > > > On Sat, Jun 18, 2016 at 2:48 PM, Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > >> The best thing to do is to read the contributors documentation, and then >> get started with an issue labelled easy. >> >> Cheers, >> >> Ga?l >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Information System Engineer, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Sat Jun 18 19:18:32 2016 From: ragvrv at gmail.com (Raghav R V) Date: Sun, 19 Jun 2016 01:18:32 +0200 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: <57602D29.1070203@gmail.com> Message-ID: IMHO the squash and merge should not be used when there are commits from 2 or more different authors to avoid crediting only a single author. On Tue, Jun 14, 2016 at 6:40 PM, Tom DLT wrote: > @Andreas > It's a bit hidden: You need to click on "Merge pull-request", then do > *not* click on "Confirm merge", but on the small arrow to the right, and > select "Squash and merge". > > 2016-06-14 18:13 GMT+02:00 Andreas Mueller : > >> I'm +1 for using the button when appropriate. >> I think it should be up to the merging person to make a call whether a >> squash is a better >> logical unit than all the commits. >> I would set like a soft limit at ~5 commits or something. If your PR has >> more than 5 separate >> big logical units, it's probably too big. >> >> The button is enabled in the settings but I can't see it. >> Am I being stupid? >> >> >> On 06/14/2016 06:58 AM, Joel Nothman wrote: >> >> Sounds good to me. Thank goodness someone reads the documentation! >> >> On 14 June 2016 at 19:51, Alexandre Gramfort < >> alexandre.gramfort at telecom-paristech.fr> wrote: >> >>> > We could stop squashing during development, and use the new >>> Squash-and-Merge >>> > button on GitHub. >>> > What do you think? >>> >>> +1 >>> >>> the reason I see for squashing during dev is to avoid killing the >>> browser when reviewing. It really rarely happens though. >>> >>> A >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Jun 18 19:53:00 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 19 Jun 2016 09:53:00 +1000 Subject: [scikit-learn] The culture of commit squashing In-Reply-To: References: <57602D29.1070203@gmail.com> Message-ID: Yes, I tried to work that out regarding @afouchet and @tguillemot's work, but I may have failed to. However, both are credited in what's new, etc... And commit counts really say little. On 19 June 2016 at 09:18, Raghav R V wrote: > IMHO the squash and merge should not be used when there are commits from 2 > or more different authors to avoid crediting only a single author. > > On Tue, Jun 14, 2016 at 6:40 PM, Tom DLT > wrote: > >> @Andreas >> It's a bit hidden: You need to click on "Merge pull-request", then do >> *not* click on "Confirm merge", but on the small arrow to the right, and >> select "Squash and merge". >> >> 2016-06-14 18:13 GMT+02:00 Andreas Mueller : >> >>> I'm +1 for using the button when appropriate. >>> I think it should be up to the merging person to make a call whether a >>> squash is a better >>> logical unit than all the commits. >>> I would set like a soft limit at ~5 commits or something. If your PR has >>> more than 5 separate >>> big logical units, it's probably too big. >>> >>> The button is enabled in the settings but I can't see it. >>> Am I being stupid? >>> >>> >>> On 06/14/2016 06:58 AM, Joel Nothman wrote: >>> >>> Sounds good to me. Thank goodness someone reads the documentation! >>> >>> On 14 June 2016 at 19:51, Alexandre Gramfort < >>> alexandre.gramfort at telecom-paristech.fr> wrote: >>> >>>> > We could stop squashing during development, and use the new >>>> Squash-and-Merge >>>> > button on GitHub. >>>> > What do you think? >>>> >>>> +1 >>>> >>>> the reason I see for squashing during dev is to avoid killing the >>>> browser when reviewing. It really rarely happens though. >>>> >>>> A >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at cgl.ucsf.edu Sun Jun 19 06:25:24 2016 From: ross at cgl.ucsf.edu (Bill Ross) Date: Sun, 19 Jun 2016 03:25:24 -0700 Subject: [scikit-learn] interpretation of manifold.MDS results Message-ID: <6a62effd-7053-ef98-33e6-1b67ad6f93bc@cgl.ucsf.edu> I have a set of photos and a set of metrics for assigning distances between them. I am out to find the dimensionality of (space needed to embed) each set of distances, out of curiosity and to inform my use of the metrics in forming the character of an AI. It was suggested that manifold.MDS could help, and it looked perfect for a toy case I wrote. However, with actual data I found that the stress_ term doesn't go to 0 when the number of components reaches the number of points: http://stackoverflow.com/questions/37855596/calculate-the-spatial-dimension-of-a-graph I'm not sure if this question has achieved visibility, since I added the scikit-learn tag after it was created. I'd appreciate any discussion of the meaning of the value in the limit for stress_, and whether a meaningful dimensionality measure can be derived. Thanks, Bill From olologin at gmail.com Tue Jun 21 00:09:30 2016 From: olologin at gmail.com (olologin) Date: Tue, 21 Jun 2016 07:09:30 +0300 Subject: [scikit-learn] Code review In-Reply-To: References: <20160618091808.GH2458470@phare.normalesup.org> Message-ID: <5768BDFA.3000807@gmail.com> Hi guys, I know scikit-learn may not be your main project, and you all are very busy at work so you don't have free time to review all pull requests, I understand it. Is there something project leaders can do to speed-up review process? Because I have bunch of pull requests which I made 5-7 months ago, and they are relatively useful :), but they wasn't reviewed. Some of them are quite simple for review (~10 lines in python), and still have only 1 vote. Maybe it's possible to increase size of team which have permission to review? Or all I can do is just wait? :) I'm writing this letter because yesterday one guy complained about this in one of my PR's (I forgot about that PR by the time) https://github.com/scikit-learn/scikit-learn/pull/6116#issuecomment-227044224. From nfliu at uw.edu Tue Jun 21 00:17:52 2016 From: nfliu at uw.edu (Nelson Liu) Date: Tue, 21 Jun 2016 04:17:52 +0000 Subject: [scikit-learn] Code review In-Reply-To: <5768BDFA.3000807@gmail.com> References: <20160618091808.GH2458470@phare.normalesup.org> <5768BDFA.3000807@gmail.com> Message-ID: The review process has always been quite slow; the only thing you can do is ping and try to fix things on your side as quickly as possible. There are a lot of PRs in development at any one time, and it's difficult for the reviewers (let alone the contributors, as you mentioned) to keep track of everything that's being done. If you find that no one has looked at your code in awhile, I'd ping some of the core contributors. Also see: http://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention On Mon, Jun 20, 2016 at 9:10 PM olologin wrote: > Hi guys, I know scikit-learn may not be your main project, and you all > are very busy at work so you don't have free time to review all pull > requests, I understand it. > > Is there something project leaders can do to speed-up review process? > Because I have bunch of pull requests which I made 5-7 months ago, and > they are relatively useful :), but they wasn't reviewed. Some of them > are quite simple for review (~10 lines in python), and still have only 1 > vote. Maybe it's possible to increase size of team which have permission > to review? Or all I can do is just wait? :) > > I'm writing this letter because yesterday one guy complained about this > in one of my PR's (I forgot about that PR by the time) > > https://github.com/scikit-learn/scikit-learn/pull/6116#issuecomment-227044224 > . > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Jun 21 00:35:13 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 21 Jun 2016 14:35:13 +1000 Subject: [scikit-learn] Code review In-Reply-To: References: <20160618091808.GH2458470@phare.normalesup.org> <5768BDFA.3000807@gmail.com> Message-ID: I think perhaps that FAQ should be updated to say "nag if needed"! Apologies for that delay, @olologin. Yes, it would be good if we had a better way of organising reviewing priorities, but between github's feature set and the distributed nature of the core dev team, we land up relying on chance, or a developer being interested in a particular feature. Again, nagging helps. On 21 June 2016 at 14:17, Nelson Liu wrote: > The review process has always been quite slow; the only thing you can do > is ping and try to fix things on your side as quickly as possible. There > are a lot of PRs in development at any one time, and it's difficult for the > reviewers (let alone the contributors, as you mentioned) to keep track of > everything that's being done. If you find that no one has looked at your > code in awhile, I'd ping some of the core contributors. Also see: > http://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention > > On Mon, Jun 20, 2016 at 9:10 PM olologin wrote: > >> Hi guys, I know scikit-learn may not be your main project, and you all >> are very busy at work so you don't have free time to review all pull >> requests, I understand it. >> >> Is there something project leaders can do to speed-up review process? >> Because I have bunch of pull requests which I made 5-7 months ago, and >> they are relatively useful :), but they wasn't reviewed. Some of them >> are quite simple for review (~10 lines in python), and still have only 1 >> vote. Maybe it's possible to increase size of team which have permission >> to review? Or all I can do is just wait? :) >> >> I'm writing this letter because yesterday one guy complained about this >> in one of my PR's (I forgot about that PR by the time) >> >> https://github.com/scikit-learn/scikit-learn/pull/6116#issuecomment-227044224 >> . >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gracehuangeh at gmail.com Tue Jun 21 04:25:04 2016 From: gracehuangeh at gmail.com (Enhui HUANG) Date: Tue, 21 Jun 2016 10:25:04 +0200 Subject: [scikit-learn] The sum of feature importances !=1 Message-ID: Hi, When I ran the following code: X, y = make_classification(n_samples=100) clf = GradientBoostingClassifier(random_state=0).fit(X, y) imp=clf.feature_importances_ print "The sum of feature importances:", sum(imp) The sum of feature importances is not always equal to 1. So do you have a nice explanation for this situation? Besides, if a tree only contains a root, could we say all its feature importances are 0? I guess the root trees will influence sum of feature importances. Is it right? Best, Enhui -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.taylor at stanford.edu Tue Jun 21 22:18:12 2016 From: jonathan.taylor at stanford.edu (Jonathan Taylor) Date: Tue, 21 Jun 2016 19:18:12 -0700 Subject: [scikit-learn] isotonic regression weird behavior? Message-ID: Was trying to fit isotonic regression with non-trivial y_min and y_max: In [*17*]: X Out[*17*]: array([ 1.26336413, 1.31853693, -0.57200917, 0.3072928 , -0.70686507, -0.17614937, -1.59943059, 1.05908504, 1.3958263 , 1.90580318, 0.20992272, 0.02836316, -0.08092235, 0.44438247, 0.01791253, -0.3771914 , -0.89577538, -0.37726249, -1.32687569, 0.18013201]) In [*18*]: iso.isotonic_regression(X, y_min=0, y_max=0.1) Out[*18*]: array([-0.00826919, -0.00826919, -0.00826919, -0.00826919, -0.00826919, -0.00826919, -0.00826919, 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344]) The solution does not satisfy the bounds that each entry should be in [0,0.1] -- Jonathan Taylor Dept. of Statistics Sequoia Hall, 137 390 Serra Mall Stanford, CA 94305 Tel: 650.723.9230 Fax: 650.725.8977 Web: http://www-stat.stanford.edu/~jtaylo -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.taylor at stanford.edu Tue Jun 21 22:19:42 2016 From: jonathan.taylor at stanford.edu (Jonathan Taylor) Date: Tue, 21 Jun 2016 19:19:42 -0700 Subject: [scikit-learn] isotonic regression weird behavior? In-Reply-To: References: Message-ID: Should have included: In [*22*]: iso Out[*22*]: On Tue, Jun 21, 2016 at 7:18 PM, Jonathan Taylor < jonathan.taylor at stanford.edu> wrote: > Was trying to fit isotonic regression with non-trivial y_min and y_max: > > In [*17*]: X > > Out[*17*]: > > array([ 1.26336413, 1.31853693, -0.57200917, 0.3072928 , -0.70686507, > > -0.17614937, -1.59943059, 1.05908504, 1.3958263 , 1.90580318, > > 0.20992272, 0.02836316, -0.08092235, 0.44438247, 0.01791253, > > -0.3771914 , -0.89577538, -0.37726249, -1.32687569, 0.18013201]) > > > In [*18*]: iso.isotonic_regression(X, y_min=0, y_max=0.1) > > Out[*18*]: > > array([-0.00826919, -0.00826919, -0.00826919, -0.00826919, -0.00826919, > > -0.00826919, -0.00826919, 0.10449344, 0.10449344, 0.10449344, > > 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, > > 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344]) > > > The solution does not satisfy the bounds that each entry should be in > [0,0.1] > > > > -- > Jonathan Taylor > Dept. of Statistics > Sequoia Hall, 137 > 390 Serra Mall > Stanford, CA 94305 > Tel: 650.723.9230 > Fax: 650.725.8977 > Web: http://www-stat.stanford.edu/~jtaylo > -- Jonathan Taylor Dept. of Statistics Sequoia Hall, 137 390 Serra Mall Stanford, CA 94305 Tel: 650.723.9230 Fax: 650.725.8977 Web: http://www-stat.stanford.edu/~jtaylo -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.taylor at stanford.edu Tue Jun 21 22:21:59 2016 From: jonathan.taylor at stanford.edu (Jonathan Taylor) Date: Tue, 21 Jun 2016 19:21:59 -0700 Subject: [scikit-learn] isotonic regression weird behavior? In-Reply-To: References: Message-ID: Sorry, docstring is also a bit funny. Is the problem it is trying to solve have an __equality__ constraint for y_min, y_max or __inequality__ constraint for y_min / y_max? Either way the produced solution does not satisfy such a constraint... On Tue, Jun 21, 2016 at 7:19 PM, Jonathan Taylor < jonathan.taylor at stanford.edu> wrote: > Should have included: > > In [*22*]: iso > > Out[*22*]: '/Users/jonathantaylor/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/isotonic.pyc'> > > On Tue, Jun 21, 2016 at 7:18 PM, Jonathan Taylor < > jonathan.taylor at stanford.edu> wrote: > >> Was trying to fit isotonic regression with non-trivial y_min and y_max: >> >> In [*17*]: X >> >> Out[*17*]: >> >> array([ 1.26336413, 1.31853693, -0.57200917, 0.3072928 , -0.70686507, >> >> -0.17614937, -1.59943059, 1.05908504, 1.3958263 , 1.90580318, >> >> 0.20992272, 0.02836316, -0.08092235, 0.44438247, 0.01791253, >> >> -0.3771914 , -0.89577538, -0.37726249, -1.32687569, 0.18013201]) >> >> >> In [*18*]: iso.isotonic_regression(X, y_min=0, y_max=0.1) >> >> Out[*18*]: >> >> array([-0.00826919, -0.00826919, -0.00826919, -0.00826919, -0.00826919, >> >> -0.00826919, -0.00826919, 0.10449344, 0.10449344, 0.10449344, >> >> 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, >> >> 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344]) >> >> >> The solution does not satisfy the bounds that each entry should be in >> [0,0.1] >> >> >> >> -- >> Jonathan Taylor >> Dept. of Statistics >> Sequoia Hall, 137 >> 390 Serra Mall >> Stanford, CA 94305 >> Tel: 650.723.9230 >> Fax: 650.725.8977 >> Web: http://www-stat.stanford.edu/~jtaylo >> > > > > -- > Jonathan Taylor > Dept. of Statistics > Sequoia Hall, 137 > 390 Serra Mall > Stanford, CA 94305 > Tel: 650.723.9230 > Fax: 650.725.8977 > Web: http://www-stat.stanford.edu/~jtaylo > -- Jonathan Taylor Dept. of Statistics Sequoia Hall, 137 390 Serra Mall Stanford, CA 94305 Tel: 650.723.9230 Fax: 650.725.8977 Web: http://www-stat.stanford.edu/~jtaylo -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Jun 22 03:27:06 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 22 Jun 2016 09:27:06 +0200 Subject: [scikit-learn] isotonic regression weird behavior? In-Reply-To: References: Message-ID: <20160622072706.GF1018883@phare.normalesup.org> Looks like a bug indeed. Could you please put a small code snippet to enable us to reproduce. From nelle.varoquaux at gmail.com Wed Jun 22 11:55:06 2016 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Wed, 22 Jun 2016 08:55:06 -0700 Subject: [scikit-learn] isotonic regression weird behavior? In-Reply-To: <20160622072706.GF1018883@phare.normalesup.org> References: <20160622072706.GF1018883@phare.normalesup.org> Message-ID: I've submitted a ticket: https://github.com/scikit-learn/scikit-learn/issues/6921 with the small example Jonathan wrote up in the email. Cheers, N On 22 June 2016 at 00:27, Gael Varoquaux wrote: > Looks like a bug indeed. Could you please put a small code snippet to > enable us to reproduce. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnmarktaylor at g.harvard.edu Wed Jun 22 13:15:43 2016 From: johnmarktaylor at g.harvard.edu (Taylor, Johnmark) Date: Wed, 22 Jun 2016 13:15:43 -0400 Subject: [scikit-learn] Decoding Differences Between SKL SVM and Matlab Libsvm Even When Parameters the Same Message-ID: Hello, I am moving much of my neuroimaging coding over to Python from Matlab and so I am switching from using libsvm in Matlab to using Scikit-learn SVM in Python. Just to make sure I am not changing anything substantive about my analyses, I am experimenting with the two implementations and trying to see whether I can get them to yield identical results. In Python I am using: clf = svm.SVC(kernel='linear',C=1,probability=True) In Matlab (libsvm) I am using: clf = libsvmtrain(svm_training_labels,svm_training_vectors,['-t 0 -b 1 -c 1']) When I run the SVM using these two different ways using simulated data, I get subtly different results, even though I have fixed all of the parameters of the SVMs to be the same using input arguments (linear classifier, C=1, use probability estimates), and even though all the other default parameters seem to be the same across these functions (tolerance = .001, both using shrinking heuristics by default). To give more details regarding the simulations: One simulation I ran was designed to be absurdly difficult--it yielded 40% accuracy for Matlab libsvm, and 44% accuracy for scikit-learn svm (binary classification, chance = 50%). In this simulation, the two SVMs agreed in their predictions only 18% of the time (in other words, they were both not only guessing below chance, but they nearly always gave opposite guesses compared to each other). The other simulation was easier, yielding 68% accuracy for Matlab libsvm, and 67% accuracy for scikit-learn SVM. In this simulation, the two SVMs agreed in their predictions 97% of the time. So even though they often got it wrong, they tended to make the same wrong guesses. Any idea of what could possibly be leading to differences in the results? My understanding is that SKL uses libsvm under the hood, so it's a been confusing why the decoders are behaving differently. Both analyses are being run on the same computer (Linux OS). Thank you very much, JohnMark Taylor PhD Student, Harvard Vision Sciences Lab -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael at bommaritollc.com Wed Jun 22 13:24:34 2016 From: michael at bommaritollc.com (Michael Bommarito) Date: Wed, 22 Jun 2016 13:24:34 -0400 Subject: [scikit-learn] Decoding Differences Between SKL SVM and Matlab Libsvm Even When Parameters the Same In-Reply-To: References: Message-ID: Did you fix the random seeds across implementations as well? Differences in seeds or generators might explain this. Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Wed, Jun 22, 2016 at 1:15 PM, Taylor, Johnmark < johnmarktaylor at g.harvard.edu> wrote: > Hello, > > I am moving much of my neuroimaging coding over to Python from Matlab and > so I am switching from using libsvm in Matlab to using Scikit-learn SVM in > Python. Just to make sure I am not changing anything substantive about my > analyses, I am experimenting with the two implementations and trying to see > whether I can get them to yield identical results. > > In Python I am using: > > clf = svm.SVC(kernel='linear',C=1,probability=True) > > In Matlab (libsvm) I am using: > > clf = libsvmtrain(svm_training_labels,svm_training_vectors,['-t 0 -b 1 -c 1']) > > When I run the SVM using these two different ways using simulated data, I > get subtly different results, even though I have fixed all of the > parameters of the SVMs to be the same using input arguments (linear > classifier, C=1, use probability estimates), and even though all the other > default parameters seem to be the same across these functions (tolerance = > .001, both using shrinking heuristics by default). > > To give more details regarding the simulations: > > One simulation I ran was designed to be absurdly difficult--it yielded 40% > accuracy for Matlab libsvm, and 44% accuracy for scikit-learn svm (binary > classification, chance = 50%). In this simulation, the two SVMs agreed in > their predictions only 18% of the time (in other words, they were both not > only guessing below chance, but they nearly always gave opposite guesses > compared to each other). > > The other simulation was easier, yielding 68% accuracy for Matlab libsvm, > and 67% accuracy for scikit-learn SVM. In this simulation, the two SVMs > agreed in their predictions 97% of the time. So even though they often got > it wrong, they tended to make the same wrong guesses. > > Any idea of what could possibly be leading to differences in the results? > My understanding is that SKL uses libsvm under the hood, so it's a been > confusing why the decoders are behaving differently. Both analyses are > being run on the same computer (Linux OS). > > Thank you very much, > > JohnMark Taylor > > PhD Student, Harvard Vision Sciences Lab > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nicholdav at gmail.com Wed Jun 22 14:23:26 2016 From: nicholdav at gmail.com (David Nicholson) Date: Wed, 22 Jun 2016 14:23:26 -0400 Subject: [scikit-learn] Decoding Differences Between SKL SVM and Matlab Libsvm Even When Parameters the Same In-Reply-To: References: Message-ID: Did you try using the Python API to libsvm directly instead of through SKL? I'm guessing you have it on your computer since you have the Matlab API. That would at least let you test whether it's the fake data or whether it's SKL. Also are you loading the fake data from a .mat file into Python (e.g. with the SciPy 'loadmat' function) or are you generating it from a script? Maybe some weird floating point error between Python and Matlab is giving you the different results? This could happen if you generate the data with a script written in both Python and Matlab, for example... along the same lines as the random seed generator giving different results On Jun 22, 2016 1:27 PM, "Michael Bommarito" wrote: > Did you fix the random seeds across implementations as well? Differences > in seeds or generators might explain this. > > Thanks, > Michael J. Bommarito II, CEO > Bommarito Consulting, LLC > *Web:* http://www.bommaritollc.com > *Mobile:* +1 (646) 450-3387 > > On Wed, Jun 22, 2016 at 1:15 PM, Taylor, Johnmark < > johnmarktaylor at g.harvard.edu> wrote: > >> Hello, >> >> I am moving much of my neuroimaging coding over to Python from Matlab and >> so I am switching from using libsvm in Matlab to using Scikit-learn SVM in >> Python. Just to make sure I am not changing anything substantive about my >> analyses, I am experimenting with the two implementations and trying to see >> whether I can get them to yield identical results. >> >> In Python I am using: >> >> clf = svm.SVC(kernel='linear',C=1,probability=True) >> >> In Matlab (libsvm) I am using: >> >> clf = libsvmtrain(svm_training_labels,svm_training_vectors,['-t 0 -b 1 -c 1']) >> >> When I run the SVM using these two different ways using simulated data, I >> get subtly different results, even though I have fixed all of the >> parameters of the SVMs to be the same using input arguments (linear >> classifier, C=1, use probability estimates), and even though all the other >> default parameters seem to be the same across these functions (tolerance = >> .001, both using shrinking heuristics by default). >> >> To give more details regarding the simulations: >> >> One simulation I ran was designed to be absurdly difficult--it yielded >> 40% accuracy for Matlab libsvm, and 44% accuracy for scikit-learn svm >> (binary classification, chance = 50%). In this simulation, the two SVMs >> agreed in their predictions only 18% of the time (in other words, they were >> both not only guessing below chance, but they nearly always gave opposite >> guesses compared to each other). >> >> The other simulation was easier, yielding 68% accuracy for Matlab libsvm, >> and 67% accuracy for scikit-learn SVM. In this simulation, the two SVMs >> agreed in their predictions 97% of the time. So even though they often got >> it wrong, they tended to make the same wrong guesses. >> >> Any idea of what could possibly be leading to differences in the results? >> My understanding is that SKL uses libsvm under the hood, so it's a been >> confusing why the decoders are behaving differently. Both analyses are >> being run on the same computer (Linux OS). >> >> Thank you very much, >> >> JohnMark Taylor >> >> PhD Student, Harvard Vision Sciences Lab >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael at bommaritollc.com Wed Jun 22 14:37:53 2016 From: michael at bommaritollc.com (Michael Bommarito) Date: Wed, 22 Jun 2016 14:37:53 -0400 Subject: [scikit-learn] isotonic regression weird behavior? In-Reply-To: References: <20160622072706.GF1018883@phare.normalesup.org> Message-ID: A few quick thoughts: 1. What does the `isoreg` method in the `isotone` R library do with this data? We have seen multiple situations where differences between our implementation/behavior and the R implementation was not expected/communicated for users, so it would be good to know and potentially address here. 2. I'd like to draw our attention back to this PR's discussion: https://github.com/scikit-learn/scikit-learn/pull/4185 In particular, this comment distinguishing between a monotonic optimization of a specific sample* and a model fit from a training sample: https://github.com/scikit-learn/scikit-learn/pull/4185#issuecomment-72875303 For a long time, fit_transform() and fit() returned different results, and we have broken and unbroken this package for different use cases a few times over the last two years (e.g., `slinear` switch). Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Wed, Jun 22, 2016 at 11:55 AM, Nelle Varoquaux wrote: > I've submitted a ticket: > https://github.com/scikit-learn/scikit-learn/issues/6921 > with the small example Jonathan wrote up in the email. > > Cheers, > N > > On 22 June 2016 at 00:27, Gael Varoquaux > wrote: > >> Looks like a bug indeed. Could you please put a small code snippet to >> enable us to reproduce. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.taylor at stanford.edu Wed Jun 22 14:45:19 2016 From: jonathan.taylor at stanford.edu (Jonathan Taylor) Date: Wed, 22 Jun 2016 11:45:19 -0700 Subject: [scikit-learn] scikit-learn Digest, Vol 3, Issue 37 In-Reply-To: References: Message-ID: import numpy as np import nose.tools as nt from sklearn.isotonic import isotonic_regression def test_isotonic_ymin_ymax(): X = np.array([1.26, 1.31,-0.57, 0.30, -0.70, -0.17, -1.59, 1.05, 1.39, 1.90, 0.20, 0.03, -0.08, 0.44, 0.01, -0.37, -0.89, -0.37, -1.32, 0.18]) X_iso = isotonic_regression(X, y_min=0., y_max=0.1) nt.assert_true(np.all((X_iso <= 0.1) * (X_iso >= 0.))) test_isotonic_ymin_ymax() On Wed, Jun 22, 2016 at 12:27 AM, wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. isotonic regression weird behavior? (Jonathan Taylor) > 2. Re: isotonic regression weird behavior? (Jonathan Taylor) > 3. Re: isotonic regression weird behavior? (Jonathan Taylor) > 4. Re: isotonic regression weird behavior? (Gael Varoquaux) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 21 Jun 2016 19:18:12 -0700 > From: Jonathan Taylor > To: scikit-learn at python.org > Subject: [scikit-learn] isotonic regression weird behavior? > Message-ID: > < > CANmCCuSkekfbjH0d450_Cm5V3JYVXsSgEyGtj3pNmA6Ek8CssQ at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Was trying to fit isotonic regression with non-trivial y_min and y_max: > > In [*17*]: X > > Out[*17*]: > > array([ 1.26336413, 1.31853693, -0.57200917, 0.3072928 , -0.70686507, > > -0.17614937, -1.59943059, 1.05908504, 1.3958263 , 1.90580318, > > 0.20992272, 0.02836316, -0.08092235, 0.44438247, 0.01791253, > > -0.3771914 , -0.89577538, -0.37726249, -1.32687569, 0.18013201]) > > > In [*18*]: iso.isotonic_regression(X, y_min=0, y_max=0.1) > > Out[*18*]: > > array([-0.00826919, -0.00826919, -0.00826919, -0.00826919, -0.00826919, > > -0.00826919, -0.00826919, 0.10449344, 0.10449344, 0.10449344, > > 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, > > 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344]) > > > The solution does not satisfy the bounds that each entry should be in > [0,0.1] > > > > -- > Jonathan Taylor > Dept. of Statistics > Sequoia Hall, 137 > 390 Serra Mall > Stanford, CA 94305 > Tel: 650.723.9230 > Fax: 650.725.8977 > Web: http://www-stat.stanford.edu/~jtaylo > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20160621/178cef23/attachment-0001.html > > > > ------------------------------ > > Message: 2 > Date: Tue, 21 Jun 2016 19:19:42 -0700 > From: Jonathan Taylor > To: scikit-learn at python.org > Subject: Re: [scikit-learn] isotonic regression weird behavior? > Message-ID: > 42g at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Should have included: > > In [*22*]: iso > > Out[*22*]: > '/Users/jonathantaylor/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/isotonic.pyc'> > > On Tue, Jun 21, 2016 at 7:18 PM, Jonathan Taylor < > jonathan.taylor at stanford.edu> wrote: > > > Was trying to fit isotonic regression with non-trivial y_min and y_max: > > > > In [*17*]: X > > > > Out[*17*]: > > > > array([ 1.26336413, 1.31853693, -0.57200917, 0.3072928 , -0.70686507, > > > > -0.17614937, -1.59943059, 1.05908504, 1.3958263 , 1.90580318, > > > > 0.20992272, 0.02836316, -0.08092235, 0.44438247, 0.01791253, > > > > -0.3771914 , -0.89577538, -0.37726249, -1.32687569, 0.18013201]) > > > > > > In [*18*]: iso.isotonic_regression(X, y_min=0, y_max=0.1) > > > > Out[*18*]: > > > > array([-0.00826919, -0.00826919, -0.00826919, -0.00826919, -0.00826919, > > > > -0.00826919, -0.00826919, 0.10449344, 0.10449344, 0.10449344, > > > > 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, > > > > 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344]) > > > > > > The solution does not satisfy the bounds that each entry should be in > > [0,0.1] > > > > > > > > -- > > Jonathan Taylor > > Dept. of Statistics > > Sequoia Hall, 137 > > 390 Serra Mall > > Stanford, CA 94305 > > Tel: 650.723.9230 > > Fax: 650.725.8977 > > Web: http://www-stat.stanford.edu/~jtaylo > > > > > > -- > Jonathan Taylor > Dept. of Statistics > Sequoia Hall, 137 > 390 Serra Mall > Stanford, CA 94305 > Tel: 650.723.9230 > Fax: 650.725.8977 > Web: http://www-stat.stanford.edu/~jtaylo > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20160621/2e999af0/attachment-0001.html > > > > ------------------------------ > > Message: 3 > Date: Tue, 21 Jun 2016 19:21:59 -0700 > From: Jonathan Taylor > To: scikit-learn at python.org > Subject: Re: [scikit-learn] isotonic regression weird behavior? > Message-ID: > < > CANmCCuQY33TX884_7owwyz1nAC8tpgQhJoZCYTUxSsSpXbSRHA at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Sorry, docstring is also a bit funny. > > Is the problem it is trying to solve have an __equality__ constraint for > y_min, y_max or __inequality__ constraint for y_min / y_max? > > Either way the produced solution does not satisfy such a constraint... > > On Tue, Jun 21, 2016 at 7:19 PM, Jonathan Taylor < > jonathan.taylor at stanford.edu> wrote: > > > Should have included: > > > > In [*22*]: iso > > > > Out[*22*]: > > '/Users/jonathantaylor/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/isotonic.pyc'> > > > > On Tue, Jun 21, 2016 at 7:18 PM, Jonathan Taylor < > > jonathan.taylor at stanford.edu> wrote: > > > >> Was trying to fit isotonic regression with non-trivial y_min and y_max: > >> > >> In [*17*]: X > >> > >> Out[*17*]: > >> > >> array([ 1.26336413, 1.31853693, -0.57200917, 0.3072928 , -0.70686507, > >> > >> -0.17614937, -1.59943059, 1.05908504, 1.3958263 , 1.90580318, > >> > >> 0.20992272, 0.02836316, -0.08092235, 0.44438247, 0.01791253, > >> > >> -0.3771914 , -0.89577538, -0.37726249, -1.32687569, 0.18013201]) > >> > >> > >> In [*18*]: iso.isotonic_regression(X, y_min=0, y_max=0.1) > >> > >> Out[*18*]: > >> > >> array([-0.00826919, -0.00826919, -0.00826919, -0.00826919, -0.00826919, > >> > >> -0.00826919, -0.00826919, 0.10449344, 0.10449344, 0.10449344, > >> > >> 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344, > >> > >> 0.10449344, 0.10449344, 0.10449344, 0.10449344, 0.10449344]) > >> > >> > >> The solution does not satisfy the bounds that each entry should be in > >> [0,0.1] > >> > >> > >> > >> -- > >> Jonathan Taylor > >> Dept. of Statistics > >> Sequoia Hall, 137 > >> 390 Serra Mall > >> Stanford, CA 94305 > >> Tel: 650.723.9230 > >> Fax: 650.725.8977 > >> Web: http://www-stat.stanford.edu/~jtaylo > >> > > > > > > > > -- > > Jonathan Taylor > > Dept. of Statistics > > Sequoia Hall, 137 > > 390 Serra Mall > > Stanford, CA 94305 > > Tel: 650.723.9230 > > Fax: 650.725.8977 > > Web: http://www-stat.stanford.edu/~jtaylo > > > > > > -- > Jonathan Taylor > Dept. of Statistics > Sequoia Hall, 137 > 390 Serra Mall > Stanford, CA 94305 > Tel: 650.723.9230 > Fax: 650.725.8977 > Web: http://www-stat.stanford.edu/~jtaylo > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20160621/f7c40e1c/attachment-0001.html > > > > ------------------------------ > > Message: 4 > Date: Wed, 22 Jun 2016 09:27:06 +0200 > From: Gael Varoquaux > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] isotonic regression weird behavior? > Message-ID: <20160622072706.GF1018883 at phare.normalesup.org> > Content-Type: text/plain; charset=us-ascii > > Looks like a bug indeed. Could you please put a small code snippet to > enable us to reproduce. > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 3, Issue 37 > ******************************************* > -- Jonathan Taylor Dept. of Statistics Sequoia Hall, 137 390 Serra Mall Stanford, CA 94305 Tel: 650.723.9230 Fax: 650.725.8977 Web: http://www-stat.stanford.edu/~jtaylo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: isobug.py Type: text/x-python-script Size: 468 bytes Desc: not available URL: From johnmarktaylor at g.harvard.edu Wed Jun 22 14:50:25 2016 From: johnmarktaylor at g.harvard.edu (Taylor, Johnmark) Date: Wed, 22 Jun 2016 14:50:25 -0400 Subject: [scikit-learn] Decoding Differences Between SKL SVM and Matlab Libsvm Even When Parameters the Same Message-ID: Many thanks for the responses thus far! *Did you fix the random seeds across implementations as well? Differencesin seeds or generators might explain this.* The implementation of libsvm used by Matlab always has a seed of 1. I tried setting the seed for SKL SVM to 1 (and 0, 2, 3, and 4) as well, and the results were still different. *Did you try using the Python API to libsvm directly instead of through SKL?I'm guessing you have it on your computer since you have the Matlab API.That would at least let you test whether it's the fake data or whether it'sSKL.* I'll give that a shot next, thanks! *Also are you loading the fake data from a .mat file into Python (e.g. withthe SciPy 'loadmat' function) or are you generating it from a script? Maybesome weird floating point error between Python and Matlab is giving you thedifferent results? This could happen if you generate the data with a scriptwritten in both Python and Matlab, for example... along the same lines asthe random seed generator giving different results* I'm generating the fake data with a Python script and saving it to a .txt file, which is then loaded in by Python and Matlab in their respective scripts. To make sure there's no truncation error going on when they load in this .txt file to get the fake data, I applied the floor function to both sets of vectors (to make them ints) in both the Python and Matlab scripts, and they still give different results. So I don't think it's a data issue. -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael at bommaritollc.com Wed Jun 22 15:07:47 2016 From: michael at bommaritollc.com (Michael Bommarito) Date: Wed, 22 Jun 2016 15:07:47 -0400 Subject: [scikit-learn] Decoding Differences Between SKL SVM and Matlab Libsvm Even When Parameters the Same In-Reply-To: References: Message-ID: Have you tried comparing the fit support vectors prior to comparing predicted values? You might need to set SaveSupportVectors in Matlab first. Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Wed, Jun 22, 2016 at 2:50 PM, Taylor, Johnmark < johnmarktaylor at g.harvard.edu> wrote: > Many thanks for the responses thus far! > > > *Did you fix the random seeds across implementations as well? > Differencesin seeds or generators might explain this.* > > The implementation of libsvm used by Matlab always has a seed of 1. I > tried setting the seed for SKL SVM to 1 (and 0, 2, 3, and 4) as well, and > the results were still different. > > > > > *Did you try using the Python API to libsvm directly instead of through > SKL?I'm guessing you have it on your computer since you have the Matlab > API.That would at least let you test whether it's the fake data or whether > it'sSKL.* > > I'll give that a shot next, thanks! > > > > > > > *Also are you loading the fake data from a .mat file into Python (e.g. > withthe SciPy 'loadmat' function) or are you generating it from a script? > Maybesome weird floating point error between Python and Matlab is giving > you thedifferent results? This could happen if you generate the data with a > scriptwritten in both Python and Matlab, for example... along the same > lines asthe random seed generator giving different results* > > I'm generating the fake data with a Python script and saving it to a .txt > file, which is then loaded in by Python and Matlab in their respective > scripts. To make sure there's no truncation error going on when they load > in this .txt file to get the fake data, I applied the floor function to > both sets of vectors (to make them ints) in both the Python and Matlab > scripts, and they still give different results. So I don't think it's a > data issue. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael at bommaritollc.com Wed Jun 22 15:16:30 2016 From: michael at bommaritollc.com (Michael Bommarito) Date: Wed, 22 Jun 2016 15:16:30 -0400 Subject: [scikit-learn] Decoding Differences Between SKL SVM and Matlab Libsvm Even When Parameters the Same In-Reply-To: References: Message-ID: Actually, I wonder if there is a difference between our implementation and Matlab's behavior. We seem to reset the seed to a hard-coded value when calling predict and predict_proba: In predict() and predict_proba() in here, we call set_predict_params(): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/libsvm.pyx#L315 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/libsvm.pyx#L381 However, set_predict_params() appears to reset the RNG to a hard-coded value of -1: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/libsvm.pyx#L261 Because you are requesting probability estimates, the state of the RNG will affect the resulting scores. If Matlab doesn't similarly reset the RNG prior to each predict call, then a difference would manifest here. I think if the underlying support vectors match but our predictions do not, this might explain it. Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Wed, Jun 22, 2016 at 3:07 PM, Michael Bommarito wrote: > Have you tried comparing the fit support vectors prior to comparing > predicted values? You might need to set SaveSupportVectors in Matlab first. > > Thanks, > Michael J. Bommarito II, CEO > Bommarito Consulting, LLC > *Web:* http://www.bommaritollc.com > *Mobile:* +1 (646) 450-3387 > > On Wed, Jun 22, 2016 at 2:50 PM, Taylor, Johnmark < > johnmarktaylor at g.harvard.edu> wrote: > >> Many thanks for the responses thus far! >> >> >> *Did you fix the random seeds across implementations as well? >> Differencesin seeds or generators might explain this.* >> >> The implementation of libsvm used by Matlab always has a seed of 1. I >> tried setting the seed for SKL SVM to 1 (and 0, 2, 3, and 4) as well, and >> the results were still different. >> >> >> >> >> *Did you try using the Python API to libsvm directly instead of through >> SKL?I'm guessing you have it on your computer since you have the Matlab >> API.That would at least let you test whether it's the fake data or whether >> it'sSKL.* >> >> I'll give that a shot next, thanks! >> >> >> >> >> >> >> *Also are you loading the fake data from a .mat file into Python (e.g. >> withthe SciPy 'loadmat' function) or are you generating it from a script? >> Maybesome weird floating point error between Python and Matlab is giving >> you thedifferent results? This could happen if you generate the data with a >> scriptwritten in both Python and Matlab, for example... along the same >> lines asthe random seed generator giving different results* >> >> I'm generating the fake data with a Python script and saving it to a .txt >> file, which is then loaded in by Python and Matlab in their respective >> scripts. To make sure there's no truncation error going on when they load >> in this .txt file to get the fake data, I applied the floor function to >> both sets of vectors (to make them ints) in both the Python and Matlab >> scripts, and they still give different results. So I don't think it's a >> data issue. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Jun 22 15:39:57 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 22 Jun 2016 15:39:57 -0400 Subject: [scikit-learn] Code review In-Reply-To: <5768BDFA.3000807@gmail.com> References: <20160618091808.GH2458470@phare.normalesup.org> <5768BDFA.3000807@gmail.com> Message-ID: <576AE98D.1050004@gmail.com> Sorry, I've been off review duty for a while, should be back later this summer ;) On 06/21/2016 12:09 AM, olologin wrote: > Hi guys, I know scikit-learn may not be your main project, and you all > are very busy at work so you don't have free time to review all pull > requests, I understand it. > > Is there something project leaders can do to speed-up review process? > Because I have bunch of pull requests which I made 5-7 months ago, and > they are relatively useful :), but they wasn't reviewed. Some of them > are quite simple for review (~10 lines in python), and still have only > 1 vote. Maybe it's possible to increase size of team which have > permission to review? Or all I can do is just wait? :) > > I'm writing this letter because yesterday one guy complained about > this in one of my PR's (I forgot about that PR by the time) > https://github.com/scikit-learn/scikit-learn/pull/6116#issuecomment-227044224. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Thu Jun 23 01:57:35 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 23 Jun 2016 07:57:35 +0200 Subject: [scikit-learn] Welcome Loic Esteve (@lesteve) as a new core contributor Message-ID: <20160623055735.GX2458470@phare.normalesup.org> Hi, I'd like to welcome Loic Esteve (@lesteve) as a new core contributor to the scikit-learn team. Loic has been reviewing very seriously a number of PR, beyond his own contributions. It's great to have him on board! Cheers, Ga?l From arnaud4567 at gmail.com Thu Jun 23 04:52:15 2016 From: arnaud4567 at gmail.com (Arnaud Joly) Date: Thu, 23 Jun 2016 10:52:15 +0200 Subject: [scikit-learn] Welcome Loic Esteve (@lesteve) as a new core contributor In-Reply-To: <20160623055735.GX2458470@phare.normalesup.org> References: <20160623055735.GX2458470@phare.normalesup.org> Message-ID: <83AC1780-CD18-4DD5-8368-E4F8516F9ED9@gmail.com> Congratulation Loic! Arnaud > On 23 Jun 2016, at 07:57, Gael Varoquaux wrote: > > Hi, > > I'd like to welcome Loic Esteve (@lesteve) as a new core contributor to > the scikit-learn team. > > Loic has been reviewing very seriously a number of PR, beyond his own > contributions. It's great to have him on board! > > Cheers, > > Ga?l > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From m.waseem.ahmad at gmail.com Thu Jun 23 05:20:04 2016 From: m.waseem.ahmad at gmail.com (muhammad waseem) Date: Thu, 23 Jun 2016 10:20:04 +0100 Subject: [scikit-learn] Random forest fitting very well Message-ID: Hi All, I am trying to use random forests for a regression problem, with 10 input variables and one output variable. I am getting very good fit even with default parameters and low n_estimators. Even with n_estimator = 10, I get R^2 value of 0.95 on testing dataset (MSE=23) and a value of 0.99 for the training set. I was wondering, if this is common with random forest or I am missing something, Could you please share your experience? The total number of sample (training +testing) are equal to 10971. Also, what are the most important parameters (max_depth, bootstrap, max_leaf_nodes etc.) that I need to play with to tune my model even further? Lastly, is there is a way I can visualise a single tree of my forest (just for demonstration purposes)? Please see a figure below to demonstrate how well it is fitting with default values. [image: Inline image 1] Thanks Kindest Regards Waseem -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: forest fitting.png Type: image/png Size: 86146 bytes Desc: not available URL: From bdholt1 at gmail.com Thu Jun 23 06:05:43 2016 From: bdholt1 at gmail.com (Brian Holt) Date: Thu, 23 Jun 2016 11:05:43 +0100 Subject: [scikit-learn] Random forest fitting very well In-Reply-To: References: Message-ID: Hi Muhammad, If you've not yet read the documentation I would highly recommend starting with the Decision Tree [1] and working your way through the examples on your own data. You'll find an example [2] of how to generate a graphviz compatible dot file and visualise it. Once your satisfied that you understand what each tree is doing with your dataset as you vary parameters, then it makes sense to try to inject some randomness by varying the features used in each tree or the samples (or both [3]). Regards, Brian [1] http://scikit-learn.org/stable/modules/tree.html [2] http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html#sklearn.tree.export_graphviz [3] http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html On 23 June 2016 at 10:20, muhammad waseem wrote: > Hi All, > I am trying to use random forests for a regression problem, with 10 input > variables and one output variable. I am getting very good fit even with > default parameters and low n_estimators. Even with n_estimator = 10, I get > R^2 value of 0.95 on testing dataset (MSE=23) and a value of 0.99 for > the training set. I was wondering, if this is common with random forest or > I am missing something, Could you please share your experience? The total > number of sample (training +testing) are equal to 10971. > Also, what are the most important parameters (max_depth, bootstrap, > max_leaf_nodes etc.) that I need to play with to tune my model even > further? Lastly, is there is a way I can visualise a single tree of my > forest (just for demonstration purposes)? > Please see a figure below to demonstrate how well it is fitting with > default values. > > > > [image: Inline image 1] > Thanks > Kindest Regards > Waseem > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: forest fitting.png Type: image/png Size: 86146 bytes Desc: not available URL: From m.waseem.ahmad at gmail.com Thu Jun 23 06:28:17 2016 From: m.waseem.ahmad at gmail.com (muhammad waseem) Date: Thu, 23 Jun 2016 11:28:17 +0100 Subject: [scikit-learn] Random forest fitting very well In-Reply-To: References: Message-ID: Hi Brian, Thanks for your email, I did try tree.export_graphviz(model,out_file='tree.dot'),but I got an error saying AttributeError: 'RandomForestRegressor' object has no attribute 'tree_' which I think is because this is a forest, not a single tree that's why I can't visualise it, No? Also, do you have any comments on the results that I got with default values? Regards Waseem On Thu, Jun 23, 2016 at 11:05 AM, Brian Holt wrote: > Hi Muhammad, > > If you've not yet read the documentation I would highly recommend starting > with the Decision Tree [1] and working your way through the examples on > your own data. You'll find an example [2] of how to generate a graphviz > compatible dot file and visualise it. > > Once your satisfied that you understand what each tree is doing with your > dataset as you vary parameters, then it makes sense to try to inject some > randomness by varying the features used in each tree or the samples (or > both [3]). > > Regards, > Brian > > [1] http://scikit-learn.org/stable/modules/tree.html > [2] > http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html#sklearn.tree.export_graphviz > [3] > http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html > > On 23 June 2016 at 10:20, muhammad waseem > wrote: > >> Hi All, >> I am trying to use random forests for a regression problem, with 10 input >> variables and one output variable. I am getting very good fit even with >> default parameters and low n_estimators. Even with n_estimator = 10, I get >> R^2 value of 0.95 on testing dataset (MSE=23) and a value of 0.99 for >> the training set. I was wondering, if this is common with random forest or >> I am missing something, Could you please share your experience? The total >> number of sample (training +testing) are equal to 10971. >> Also, what are the most important parameters (max_depth, bootstrap, >> max_leaf_nodes etc.) that I need to play with to tune my model even >> further? Lastly, is there is a way I can visualise a single tree of my >> forest (just for demonstration purposes)? >> Please see a figure below to demonstrate how well it is fitting with >> default values. >> >> >> >> [image: Inline image 1] >> Thanks >> Kindest Regards >> Waseem >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: forest fitting.png Type: image/png Size: 86146 bytes Desc: not available URL: From cbrew at acm.org Thu Jun 23 07:00:43 2016 From: cbrew at acm.org (chris brew) Date: Thu, 23 Jun 2016 12:00:43 +0100 Subject: [scikit-learn] Random forest fitting very well In-Reply-To: References: Message-ID: It is probably a good idea to start by separating off part of your training data into a held-out development set that is not used for training, which you can use to create learning curves and estimate probable performance on unseen data. I really recommend Andrew Ng's machine learning course material from Stanford and Coursera. It shows you how to use learning curves to understand your problem and also the way that different estimators behave. There are many estimators that will achieve an extremely good fit to typical training data, but the differences between estimators show up mostly in what happens with unseen test data. Personally I always start by seeing how well simple classifiers or regressors do (Naive Bayes, linear regression, etc.), then try regularized linear models like ElasticNets then try SVMs, then try random forests or other ensemble models. That way, I finish up using the powerful and complex models only when the data demands it. On 23 June 2016 at 10:20, muhammad waseem wrote: > Hi All, > I am trying to use random forests for a regression problem, with 10 input > variables and one output variable. I am getting very good fit even with > default parameters and low n_estimators. Even with n_estimator = 10, I get > R^2 value of 0.95 on testing dataset (MSE=23) and a value of 0.99 for > the training set. I was wondering, if this is common with random forest or > I am missing something, Could you please share your experience? The total > number of sample (training +testing) are equal to 10971. > Also, what are the most important parameters (max_depth, bootstrap, > max_leaf_nodes etc.) that I need to play with to tune my model even > further? Lastly, is there is a way I can visualise a single tree of my > forest (just for demonstration purposes)? > Please see a figure below to demonstrate how well it is fitting with > default values. > > > > [image: Inline image 1] > Thanks > Kindest Regards > Waseem > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: forest fitting.png Type: image/png Size: 86146 bytes Desc: not available URL: From joel.nothman at gmail.com Thu Jun 23 07:11:46 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 23 Jun 2016 21:11:46 +1000 Subject: [scikit-learn] Welcome Loic Esteve (@lesteve) as a new core contributor In-Reply-To: <83AC1780-CD18-4DD5-8368-E4F8516F9ED9@gmail.com> References: <20160623055735.GX2458470@phare.normalesup.org> <83AC1780-CD18-4DD5-8368-E4F8516F9ED9@gmail.com> Message-ID: Thanks for some great work so far, Loic; I'm looking forward to more of your well-considered comments and contributions! On 23 June 2016 at 18:52, Arnaud Joly wrote: > Congratulation Loic! > > Arnaud > > > On 23 Jun 2016, at 07:57, Gael Varoquaux > wrote: > > > > Hi, > > > > I'd like to welcome Loic Esteve (@lesteve) as a new core contributor to > > the scikit-learn team. > > > > Loic has been reviewing very seriously a number of PR, beyond his own > > contributions. It's great to have him on board! > > > > Cheers, > > > > Ga?l > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.waseem.ahmad at gmail.com Thu Jun 23 08:08:22 2016 From: m.waseem.ahmad at gmail.com (muhammad waseem) Date: Thu, 23 Jun 2016 13:08:22 +0100 Subject: [scikit-learn] Random forest fitting very well In-Reply-To: References: Message-ID: Thanks, Chris. I will look into your recommendations. I have tried artificial neural network and it was giving me good results on test set as well. Regards Waseem On Thu, Jun 23, 2016 at 12:00 PM, chris brew wrote: > It is probably a good idea to start by separating off part of your > training data into a held-out development set that is not used for > training, which you can use to create learning curves and estimate probable > performance on unseen data. I really recommend Andrew Ng's machine learning > course material from Stanford and Coursera. It shows you how to use > learning curves to understand your problem and also the way that different > estimators behave. > > > There are many estimators that will achieve an extremely good fit to > typical training data, but the differences between estimators show up > mostly in what happens with unseen test data. Personally I always start by > seeing how well simple classifiers or regressors do (Naive Bayes, linear > regression, etc.), then try regularized linear models like ElasticNets then > try SVMs, then try random forests or other ensemble models. That way, I > finish up using the powerful and complex models only when the data demands > it. > > On 23 June 2016 at 10:20, muhammad waseem > wrote: > >> Hi All, >> I am trying to use random forests for a regression problem, with 10 input >> variables and one output variable. I am getting very good fit even with >> default parameters and low n_estimators. Even with n_estimator = 10, I get >> R^2 value of 0.95 on testing dataset (MSE=23) and a value of 0.99 for >> the training set. I was wondering, if this is common with random forest or >> I am missing something, Could you please share your experience? The total >> number of sample (training +testing) are equal to 10971. >> Also, what are the most important parameters (max_depth, bootstrap, >> max_leaf_nodes etc.) that I need to play with to tune my model even >> further? Lastly, is there is a way I can visualise a single tree of my >> forest (just for demonstration purposes)? >> Please see a figure below to demonstrate how well it is fitting with >> default values. >> >> >> >> [image: Inline image 1] >> Thanks >> Kindest Regards >> Waseem >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: forest fitting.png Type: image/png Size: 86146 bytes Desc: not available URL: From ragvrv at gmail.com Thu Jun 23 08:47:35 2016 From: ragvrv at gmail.com (Raghav R V) Date: Thu, 23 Jun 2016 14:47:35 +0200 Subject: [scikit-learn] Code review In-Reply-To: <576AE98D.1050004@gmail.com> References: <20160618091808.GH2458470@phare.normalesup.org> <5768BDFA.3000807@gmail.com> <576AE98D.1050004@gmail.com> Message-ID: > "nag if needed"! I always assume it to be an implicit advice ;P On Wed, Jun 22, 2016 at 9:39 PM, Andreas Mueller wrote: > Sorry, I've been off review duty for a while, should be back later this > summer ;) > > > On 06/21/2016 12:09 AM, olologin wrote: > >> Hi guys, I know scikit-learn may not be your main project, and you all >> are very busy at work so you don't have free time to review all pull >> requests, I understand it. >> >> Is there something project leaders can do to speed-up review process? >> Because I have bunch of pull requests which I made 5-7 months ago, and they >> are relatively useful :), but they wasn't reviewed. Some of them are quite >> simple for review (~10 lines in python), and still have only 1 vote. Maybe >> it's possible to increase size of team which have permission to review? Or >> all I can do is just wait? :) >> >> I'm writing this letter because yesterday one guy complained about this >> in one of my PR's (I forgot about that PR by the time) >> https://github.com/scikit-learn/scikit-learn/pull/6116#issuecomment-227044224 >> . >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jun 23 08:51:32 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 23 Jun 2016 22:51:32 +1000 Subject: [scikit-learn] Code review In-Reply-To: References: <20160618091808.GH2458470@phare.normalesup.org> <5768BDFA.3000807@gmail.com> <576AE98D.1050004@gmail.com> Message-ID: On 23 June 2016 at 22:47, Raghav R V wrote: > > "nag if needed"! > > I always assume it to be an implicit advice ;P > I could tell. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Thu Jun 23 12:32:56 2016 From: ragvrv at gmail.com (Raghav R V) Date: Thu, 23 Jun 2016 18:32:56 +0200 Subject: [scikit-learn] Code review In-Reply-To: References: <20160618091808.GH2458470@phare.normalesup.org> <5768BDFA.3000807@gmail.com> <576AE98D.1050004@gmail.com> Message-ID: Reg. the "Needs Review" tag - Could I request the maintainers to unlabel the PR once a review has been completed and is waiting for the author? (Should filter out a lot of noise). The use case I envision for this tag would be to serve as a bookmark or a green flag to the maintainer who labels it so he could revisit later or other maintainers who might have time to look into that PR. Currently all PRs with `[MRG.*]` are labelled with this (~90) and many of them are waiting for the author to respond back. Also I feel it would be useful to have a second label (like Manoj suggested in a previous thread) to separate those PR which needs to be reviewed in detail from those which just needs a second look. (read as - a label used by Maintainer A to signal any other Maintainer who can spare a few minutes to take a glance and merge). Maybe "Needs Quick Review" / "Needs 2nd Review"?. Same for the "Need Contributors" tag. Should be untagged once someone raises a PR. (A lot of new contributors have complained that those issues marked "Need Contributors" are taken. I know I am responsible for 2 such issues :P But I've also asked the commenter to go ahead and raise a PR in both cases.) On Thu, Jun 23, 2016 at 2:51 PM, Joel Nothman wrote: > On 23 June 2016 at 22:47, Raghav R V wrote: > >> > "nag if needed"! >> >> I always assume it to be an implicit advice ;P >> > > I could tell. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manojkumarsivaraj334 at gmail.com Thu Jun 23 13:08:38 2016 From: manojkumarsivaraj334 at gmail.com (Manoj Kumar) Date: Thu, 23 Jun 2016 10:08:38 -0700 Subject: [scikit-learn] Welcome Loic Esteve (@lesteve) as a new core contributor In-Reply-To: References: <20160623055735.GX2458470@phare.normalesup.org> <83AC1780-CD18-4DD5-8368-E4F8516F9ED9@gmail.com> Message-ID: Hi Loic, Congratulation! On Thu, Jun 23, 2016 at 4:11 AM, Joel Nothman wrote: > Thanks for some great work so far, Loic; I'm looking forward to more of > your well-considered comments and contributions! > > On 23 June 2016 at 18:52, Arnaud Joly wrote: > >> Congratulation Loic! >> >> Arnaud >> >> > On 23 Jun 2016, at 07:57, Gael Varoquaux >> wrote: >> > >> > Hi, >> > >> > I'd like to welcome Loic Esteve (@lesteve) as a new core contributor to >> > the scikit-learn team. >> > >> > Loic has been reviewing very seriously a number of PR, beyond his own >> > contributions. It's great to have him on board! >> > >> > Cheers, >> > >> > Ga?l >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Manoj, http://github.com/MechCoder -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Thu Jun 23 15:18:38 2016 From: ragvrv at gmail.com (Raghav R V) Date: Thu, 23 Jun 2016 21:18:38 +0200 Subject: [scikit-learn] Welcome Loic Esteve (@lesteve) as a new core contributor In-Reply-To: References: <20160623055735.GX2458470@phare.normalesup.org> <83AC1780-CD18-4DD5-8368-E4F8516F9ED9@gmail.com> Message-ID: Congrats Lo?c! Looking forward to your comments :) On 23 Jun 2016 19:09, "Manoj Kumar" wrote: > Hi Loic, > > Congratulation! > > On Thu, Jun 23, 2016 at 4:11 AM, Joel Nothman > wrote: > >> Thanks for some great work so far, Loic; I'm looking forward to more of >> your well-considered comments and contributions! >> >> On 23 June 2016 at 18:52, Arnaud Joly wrote: >> >>> Congratulation Loic! >>> >>> Arnaud >>> >>> > On 23 Jun 2016, at 07:57, Gael Varoquaux < >>> gael.varoquaux at normalesup.org> wrote: >>> > >>> > Hi, >>> > >>> > I'd like to welcome Loic Esteve (@lesteve) as a new core contributor to >>> > the scikit-learn team. >>> > >>> > Loic has been reviewing very seriously a number of PR, beyond his own >>> > contributions. It's great to have him on board! >>> > >>> > Cheers, >>> > >>> > Ga?l >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Manoj, > http://github.com/MechCoder > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hmf at inesctec.pt Mon Jun 27 06:27:22 2016 From: hmf at inesctec.pt (Hugo Ferreira) Date: Mon, 27 Jun 2016 11:27:22 +0100 Subject: [scikit-learn] How do we define a distance metric's parameter for grid search Message-ID: <5770FF8A.4050800@inescporto.pt> Hello, I have posted this question in Stackoverflow and did not get an answer. This seems to be a basic usage question and am therefore sending it here. I have following code snippet that attempts to do a grid search in which one of the grid parameters are the distance metrics to be used for the KNN algorithm. The example below fails if I use "wminkowski", "seuclidean" or "mahalanobis" distances metrics. # Define the parameter values that should be searched k_range = range(1,31) weights = ['uniform' , 'distance'] algos = ['auto', 'ball_tree', 'kd_tree', 'brute'] leaf_sizes = range(10, 60, 10) metrics = ["euclidean", "manhattan", "chebyshev", "minkowski", "mahalanobis"] param_grid = dict(n_neighbors = list(k_range), weights = weights, algorithm = algos, leaf_size = list(leaf_sizes), metric=metrics) param_grid # Instantiate the algorithm knn = KNeighborsClassifier(n_neighbors=10) # Instantiate the grid grid = GridSearchCV(knn, param_grid=param_grid, cv=10, scoring='accuracy', n_jobs=-1) # Fit the models using the grid parameters grid.fit(X,y) I assume this is because I have to set or define the ranges for the various distance parameters (for example p, w for ?wminkowski? - WMinkowskiDistance ). The "minkowski" distance may be working because its "p" parameter has the default 2. So my questions are: 1. Can we set the range of parameters for the distance metrics for the grid search and if so how? 2. Can we set the value of a parameters for the distance metrics for the grid search and if so how? Hope the question is clear. TIA From ahowe42 at gmail.com Mon Jun 27 06:59:38 2016 From: ahowe42 at gmail.com (Andrew Howe) Date: Mon, 27 Jun 2016 13:59:38 +0300 Subject: [scikit-learn] How do we define a distance metric's parameter for grid search In-Reply-To: <5770FF8A.4050800@inescporto.pt> References: <5770FF8A.4050800@inescporto.pt> Message-ID: I did something similar where I was using GridSearchCV over different kernel functions for SVM and not all kernel functions use the same parameters. For example, the *degree* parameter is only used by the *poly* kernel. from sklearn import svm from sklearn import cross_validation from sklearn import grid_search params = [{'kernel':['poly'],'degree':[1,2,3],'gamma':[1/p,1,2],'coef0':[-1,0,1]},\ {'kernel':['rbf'],'gamma':[1/p,1,2],'degree':[3],'coef0':[0]},\ {'kernel':['sigmoid'],'gamma':[1/p,1,2],'coef0':[-1,0,1],'degree':[3]}] GSC = grid_search.GridSearchCV(estimator = svm.SVC(), param_grid = params,\ cv = cvrand, n_jobs = -1) This worked in this instance because the svm.SVC() object only passes parameters to the kernel functions as needed: [image: Inline image 1] Hence, even though my list of dicts includes all three parameters for all types of kernels I used, they were selectively ignored. I'm not sure about parameters for the distance metrics for the KNN object, but it's a good bet it works the same way. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD Editor-in-Chief, European Journal of Mathematical Sciences Executive Editor, European Journal of Pure and Applied Mathematics www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Mon, Jun 27, 2016 at 1:27 PM, Hugo Ferreira wrote: > Hello, > > I have posted this question in Stackoverflow and did not get an answer. > This seems to be a basic usage question and am therefore sending it here. > > I have following code snippet that attempts to do a grid search in which > one of the grid parameters are the distance metrics to be used for the KNN > algorithm. The example below fails if I use "wminkowski", "seuclidean" or > "mahalanobis" distances metrics. > > # Define the parameter values that should be searched > k_range = range(1,31) > weights = ['uniform' , 'distance'] > algos = ['auto', 'ball_tree', 'kd_tree', 'brute'] > leaf_sizes = range(10, 60, 10) > metrics = ["euclidean", "manhattan", "chebyshev", "minkowski", > "mahalanobis"] > > param_grid = dict(n_neighbors = list(k_range), weights = weights, > algorithm = algos, leaf_size = list(leaf_sizes), metric=metrics) > param_grid > > # Instantiate the algorithm > knn = KNeighborsClassifier(n_neighbors=10) > > # Instantiate the grid > grid = GridSearchCV(knn, param_grid=param_grid, cv=10, scoring='accuracy', > n_jobs=-1) > > # Fit the models using the grid parameters > grid.fit(X,y) > > I assume this is because I have to set or define the ranges for the > various distance parameters (for example p, w for ?wminkowski? - > WMinkowskiDistance ). The "minkowski" distance may be working because its > "p" parameter has the default 2. > > So my questions are: > > 1. Can we set the range of parameters for the distance metrics for the > grid search and if so how? > 2. Can we set the value of a parameters for the distance metrics for the > grid search and if so how? > > Hope the question is clear. > TIA > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 43248 bytes Desc: not available URL: From joel.nothman at gmail.com Mon Jun 27 07:37:36 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 27 Jun 2016 21:37:36 +1000 Subject: [scikit-learn] How do we define a distance metric's parameter for grid search In-Reply-To: References: <5770FF8A.4050800@inescporto.pt> Message-ID: Hi Hugo, Andrew's approach -- using a list of dicts to specify multiple parameter grids -- is the correct one. However, Andrew, you don't need to include parameters that will be ignored into your parameter grid. The following will be effectively the same: params = [{'kernel':['poly'],'degree':[1,2,3],'gamma':[1/p,1,2],'coef0':[-1,0,1]}, {'kernel':['rbf'],'gamma':[1/p,1,2]}, {'kernel':['sigmoid'],'gamma':[1/p,1,2],'coef0':[-1,0,1]}] Joel On 27 June 2016 at 20:59, Andrew Howe wrote: > I did something similar where I was using GridSearchCV over different > kernel functions for SVM and not all kernel functions use the same > parameters. For example, the *degree* parameter is only used by the > *poly* kernel. > > from sklearn import svm > from sklearn import cross_validation > from sklearn import grid_search > > params = > [{'kernel':['poly'],'degree':[1,2,3],'gamma':[1/p,1,2],'coef0':[-1,0,1]},\ > {'kernel':['rbf'],'gamma':[1/p,1,2],'degree':[3],'coef0':[0]},\ > {'kernel':['sigmoid'],'gamma':[1/p,1,2],'coef0':[-1,0,1],'degree':[3]}] > GSC = grid_search.GridSearchCV(estimator = svm.SVC(), param_grid = params,\ > cv = cvrand, n_jobs = -1) > > This worked in this instance because the svm.SVC() object only passes > parameters to the kernel functions as needed: > [image: Inline image 1] > > Hence, even though my list of dicts includes all three parameters for all > types of kernels I used, they were selectively ignored. I'm not sure about > parameters for the distance metrics for the KNN object, but it's a good bet > it works the same way. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > Editor-in-Chief, European Journal of Mathematical Sciences > Executive Editor, European Journal of Pure and Applied Mathematics > www.andrewhowe.com > http://www.linkedin.com/in/ahowe42 > https://www.researchgate.net/profile/John_Howe12/ > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > On Mon, Jun 27, 2016 at 1:27 PM, Hugo Ferreira wrote: > >> Hello, >> >> I have posted this question in Stackoverflow and did not get an answer. >> This seems to be a basic usage question and am therefore sending it here. >> >> I have following code snippet that attempts to do a grid search in which >> one of the grid parameters are the distance metrics to be used for the KNN >> algorithm. The example below fails if I use "wminkowski", "seuclidean" or >> "mahalanobis" distances metrics. >> >> # Define the parameter values that should be searched >> k_range = range(1,31) >> weights = ['uniform' , 'distance'] >> algos = ['auto', 'ball_tree', 'kd_tree', 'brute'] >> leaf_sizes = range(10, 60, 10) >> metrics = ["euclidean", "manhattan", "chebyshev", "minkowski", >> "mahalanobis"] >> >> param_grid = dict(n_neighbors = list(k_range), weights = weights, >> algorithm = algos, leaf_size = list(leaf_sizes), metric=metrics) >> param_grid >> >> # Instantiate the algorithm >> knn = KNeighborsClassifier(n_neighbors=10) >> >> # Instantiate the grid >> grid = GridSearchCV(knn, param_grid=param_grid, cv=10, >> scoring='accuracy', n_jobs=-1) >> >> # Fit the models using the grid parameters >> grid.fit(X,y) >> >> I assume this is because I have to set or define the ranges for the >> various distance parameters (for example p, w for ?wminkowski? - >> WMinkowskiDistance ). The "minkowski" distance may be working because its >> "p" parameter has the default 2. >> >> So my questions are: >> >> 1. Can we set the range of parameters for the distance metrics for the >> grid search and if so how? >> 2. Can we set the value of a parameters for the distance metrics for the >> grid search and if so how? >> >> Hope the question is clear. >> TIA >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 43248 bytes Desc: not available URL: From ahowe42 at gmail.com Mon Jun 27 07:58:52 2016 From: ahowe42 at gmail.com (Andrew Howe) Date: Mon, 27 Jun 2016 14:58:52 +0300 Subject: [scikit-learn] How do we define a distance metric's parameter for grid search In-Reply-To: References: <5770FF8A.4050800@inescporto.pt> Message-ID: Yeah I know :-). I did it like that for a specific reason which I no longer remember :-D. But, you know, it was probably a good one...hahaha Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD Editor-in-Chief, European Journal of Mathematical Sciences Executive Editor, European Journal of Pure and Applied Mathematics www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Mon, Jun 27, 2016 at 2:37 PM, Joel Nothman wrote: > Hi Hugo, > > Andrew's approach -- using a list of dicts to specify multiple parameter > grids -- is the correct one. > > However, Andrew, you don't need to include parameters that will be ignored > into your parameter grid. The following will be effectively the same: > > params = > [{'kernel':['poly'],'degree':[1,2,3],'gamma':[1/p,1,2],'coef0':[-1,0,1]}, > {'kernel':['rbf'],'gamma':[1/p,1,2]}, > {'kernel':['sigmoid'],'gamma':[1/p,1,2],'coef0':[-1,0,1]}] > > Joel > > On 27 June 2016 at 20:59, Andrew Howe wrote: > >> I did something similar where I was using GridSearchCV over different >> kernel functions for SVM and not all kernel functions use the same >> parameters. For example, the *degree* parameter is only used by the >> *poly* kernel. >> >> from sklearn import svm >> from sklearn import cross_validation >> from sklearn import grid_search >> >> params = >> [{'kernel':['poly'],'degree':[1,2,3],'gamma':[1/p,1,2],'coef0':[-1,0,1]},\ >> {'kernel':['rbf'],'gamma':[1/p,1,2],'degree':[3],'coef0':[0]},\ >> {'kernel':['sigmoid'],'gamma':[1/p,1,2],'coef0':[-1,0,1],'degree':[3]}] >> GSC = grid_search.GridSearchCV(estimator = svm.SVC(), param_grid = >> params,\ >> cv = cvrand, n_jobs = -1) >> >> This worked in this instance because the svm.SVC() object only passes >> parameters to the kernel functions as needed: >> [image: Inline image 1] >> >> Hence, even though my list of dicts includes all three parameters for all >> types of kernels I used, they were selectively ignored. I'm not sure about >> parameters for the distance metrics for the KNN object, but it's a good bet >> it works the same way. >> >> Andrew >> >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> J. Andrew Howe, PhD >> Editor-in-Chief, European Journal of Mathematical Sciences >> Executive Editor, European Journal of Pure and Applied Mathematics >> www.andrewhowe.com >> http://www.linkedin.com/in/ahowe42 >> https://www.researchgate.net/profile/John_Howe12/ >> I live to learn, so I can learn to live. - me >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> >> On Mon, Jun 27, 2016 at 1:27 PM, Hugo Ferreira wrote: >> >>> Hello, >>> >>> I have posted this question in Stackoverflow and did not get an answer. >>> This seems to be a basic usage question and am therefore sending it here. >>> >>> I have following code snippet that attempts to do a grid search in which >>> one of the grid parameters are the distance metrics to be used for the KNN >>> algorithm. The example below fails if I use "wminkowski", "seuclidean" or >>> "mahalanobis" distances metrics. >>> >>> # Define the parameter values that should be searched >>> k_range = range(1,31) >>> weights = ['uniform' , 'distance'] >>> algos = ['auto', 'ball_tree', 'kd_tree', 'brute'] >>> leaf_sizes = range(10, 60, 10) >>> metrics = ["euclidean", "manhattan", "chebyshev", "minkowski", >>> "mahalanobis"] >>> >>> param_grid = dict(n_neighbors = list(k_range), weights = weights, >>> algorithm = algos, leaf_size = list(leaf_sizes), metric=metrics) >>> param_grid >>> >>> # Instantiate the algorithm >>> knn = KNeighborsClassifier(n_neighbors=10) >>> >>> # Instantiate the grid >>> grid = GridSearchCV(knn, param_grid=param_grid, cv=10, >>> scoring='accuracy', n_jobs=-1) >>> >>> # Fit the models using the grid parameters >>> grid.fit(X,y) >>> >>> I assume this is because I have to set or define the ranges for the >>> various distance parameters (for example p, w for ?wminkowski? - >>> WMinkowskiDistance ). The "minkowski" distance may be working because its >>> "p" parameter has the default 2. >>> >>> So my questions are: >>> >>> 1. Can we set the range of parameters for the distance metrics for the >>> grid search and if so how? >>> 2. Can we set the value of a parameters for the distance metrics for the >>> grid search and if so how? >>> >>> Hope the question is clear. >>> TIA >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 43248 bytes Desc: not available URL: From jaganadhg at gmail.com Mon Jun 27 14:58:53 2016 From: jaganadhg at gmail.com (JAGANADH G) Date: Mon, 27 Jun 2016 11:58:53 -0700 Subject: [scikit-learn] Spherical Kmeans #OT Message-ID: Hi , is there any Python package available for experiment with Sperical Kmeans ? -- ********************************** JAGANADH G http://jaganadhg.in *ILUGCBE* http://ilugcbe.org.in -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Mon Jun 27 15:10:59 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Mon, 27 Jun 2016 21:10:59 +0200 Subject: [scikit-learn] Spherical Kmeans #OT In-Reply-To: References: Message-ID: hmm, not an answer, and off the top of my head: if you normalize your data points to l2 norm equal 1, and then use standard kmeans with euclidean distance (which then amounts to 2 - 2 cos(angle between points)) would this be enough for your purposes? (with a bit of luck there may even be some sort of correspondence) Michael On Monday, June 27, 2016, JAGANADH G wrote: > Hi , > is there any Python package available for experiment with Sperical Kmeans ? > > > -- > ********************************** > JAGANADH G > http://jaganadhg.in > *ILUGCBE* > http://ilugcbe.org.in > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fred.mailhot at gmail.com Mon Jun 27 16:03:33 2016 From: fred.mailhot at gmail.com (Fred Mailhot) Date: Mon, 27 Jun 2016 16:03:33 -0400 Subject: [scikit-learn] Spherical Kmeans #OT In-Reply-To: References: Message-ID: Per the example here: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html if your inputs are normalized, sklearn's kmeans behaves like sperical kmeans (unless I'm misunderstanding something, which is certainly possible, caveat lector, &c )... On Jun 27, 2016 12:13 PM, "Michael Eickenberg" wrote: > hmm, not an answer, and off the top of my head: > if you normalize your data points to l2 norm equal 1, and then use > standard kmeans with euclidean distance (which then amounts to 2 - 2 > cos(angle between points)) would this be enough for your purposes? (with a > bit of luck there may even be some sort of correspondence) > > Michael > > On Monday, June 27, 2016, JAGANADH G wrote: > >> Hi , >> is there any Python package available for experiment with Sperical Kmeans >> ? >> >> >> -- >> ********************************** >> JAGANADH G >> http://jaganadhg.in >> *ILUGCBE* >> http://ilugcbe.org.in >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaganadhg at gmail.com Mon Jun 27 18:28:11 2016 From: jaganadhg at gmail.com (JAGANADH G) Date: Mon, 27 Jun 2016 15:28:11 -0700 Subject: [scikit-learn] Spherical Kmeans #OT In-Reply-To: References: Message-ID: Hi Fred and Michel, Thanks for the reply . I think I git this and am able to run it. Best Jagan On Mon, Jun 27, 2016 at 1:03 PM, Fred Mailhot wrote: > Per the example here: > > http://scikit-learn.org/stable/auto_examples/text/document_clustering.html > > if your inputs are normalized, sklearn's kmeans behaves like sperical > kmeans (unless I'm misunderstanding something, which is certainly possible, > caveat lector, &c )... > On Jun 27, 2016 12:13 PM, "Michael Eickenberg" < > michael.eickenberg at gmail.com> wrote: > >> hmm, not an answer, and off the top of my head: >> if you normalize your data points to l2 norm equal 1, and then use >> standard kmeans with euclidean distance (which then amounts to 2 - 2 >> cos(angle between points)) would this be enough for your purposes? (with a >> bit of luck there may even be some sort of correspondence) >> >> Michael >> >> On Monday, June 27, 2016, JAGANADH G wrote: >> >>> Hi , >>> is there any Python package available for experiment with Sperical >>> Kmeans ? >>> >>> >>> -- >>> ********************************** >>> JAGANADH G >>> http://jaganadhg.in >>> *ILUGCBE* >>> http://ilugcbe.org.in >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ********************************** JAGANADH G http://jaganadhg.in *ILUGCBE* http://ilugcbe.org.in -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Mon Jun 27 19:20:08 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Tue, 28 Jun 2016 01:20:08 +0200 Subject: [scikit-learn] Spherical Kmeans #OT In-Reply-To: References: Message-ID: You could do from sklearn.pipeline import make_pipeline from sklearn.preprocessing import Normalizer from sklearn.cluster import KMeans # (or e.g. MiniBatchKMeans) spherical_kmeans = make_pipeline(Normalizer(), KMeans(n_clusters=5)) On Tue, Jun 28, 2016 at 12:28 AM, JAGANADH G wrote: > Hi Fred and Michel, > > Thanks for the reply . I think I git this and am able to run it. > > > Best > Jagan > > > On Mon, Jun 27, 2016 at 1:03 PM, Fred Mailhot > wrote: > >> Per the example here: >> >> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html >> >> if your inputs are normalized, sklearn's kmeans behaves like sperical >> kmeans (unless I'm misunderstanding something, which is certainly possible, >> caveat lector, &c )... >> On Jun 27, 2016 12:13 PM, "Michael Eickenberg" < >> michael.eickenberg at gmail.com> wrote: >> >>> hmm, not an answer, and off the top of my head: >>> if you normalize your data points to l2 norm equal 1, and then use >>> standard kmeans with euclidean distance (which then amounts to 2 - 2 >>> cos(angle between points)) would this be enough for your purposes? (with a >>> bit of luck there may even be some sort of correspondence) >>> >>> Michael >>> >>> On Monday, June 27, 2016, JAGANADH G wrote: >>> >>>> Hi , >>>> is there any Python package available for experiment with Sperical >>>> Kmeans ? >>>> >>>> >>>> -- >>>> ********************************** >>>> JAGANADH G >>>> http://jaganadhg.in >>>> *ILUGCBE* >>>> http://ilugcbe.org.in >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > ********************************** > JAGANADH G > http://jaganadhg.in > *ILUGCBE* > http://ilugcbe.org.in > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Jun 27 20:09:19 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 28 Jun 2016 10:09:19 +1000 Subject: [scikit-learn] Spherical Kmeans #OT In-Reply-To: References: Message-ID: (Since Normalizer is applied to each sample independently, the Pipeline/Transformer mechanism doesn't actually provide any benefit over sklearn.preprocessing.normalize) On 28 June 2016 at 09:20, Michael Eickenberg wrote: > You could do > > from sklearn.pipeline import make_pipeline > from sklearn.preprocessing import Normalizer > from sklearn.cluster import KMeans # (or e.g. MiniBatchKMeans) > > spherical_kmeans = make_pipeline(Normalizer(), KMeans(n_clusters=5)) > > > > On Tue, Jun 28, 2016 at 12:28 AM, JAGANADH G wrote: > >> Hi Fred and Michel, >> >> Thanks for the reply . I think I git this and am able to run it. >> >> >> Best >> Jagan >> >> >> On Mon, Jun 27, 2016 at 1:03 PM, Fred Mailhot >> wrote: >> >>> Per the example here: >>> >>> >>> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html >>> >>> if your inputs are normalized, sklearn's kmeans behaves like sperical >>> kmeans (unless I'm misunderstanding something, which is certainly possible, >>> caveat lector, &c )... >>> On Jun 27, 2016 12:13 PM, "Michael Eickenberg" < >>> michael.eickenberg at gmail.com> wrote: >>> >>>> hmm, not an answer, and off the top of my head: >>>> if you normalize your data points to l2 norm equal 1, and then use >>>> standard kmeans with euclidean distance (which then amounts to 2 - 2 >>>> cos(angle between points)) would this be enough for your purposes? (with a >>>> bit of luck there may even be some sort of correspondence) >>>> >>>> Michael >>>> >>>> On Monday, June 27, 2016, JAGANADH G wrote: >>>> >>>>> Hi , >>>>> is there any Python package available for experiment with Sperical >>>>> Kmeans ? >>>>> >>>>> >>>>> -- >>>>> ********************************** >>>>> JAGANADH G >>>>> http://jaganadhg.in >>>>> *ILUGCBE* >>>>> http://ilugcbe.org.in >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> ********************************** >> JAGANADH G >> http://jaganadhg.in >> *ILUGCBE* >> http://ilugcbe.org.in >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Tue Jun 28 02:09:51 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Tue, 28 Jun 2016 08:09:51 +0200 Subject: [scikit-learn] Spherical Kmeans #OT In-Reply-To: References: Message-ID: well, true :) but you can put it in pipelines! :) (so, in that logic, is there any reason for keeping it in the package?) On Tuesday, June 28, 2016, Joel Nothman wrote: > (Since Normalizer is applied to each sample independently, the > Pipeline/Transformer mechanism doesn't actually provide any benefit over > sklearn.preprocessing.normalize) > > On 28 June 2016 at 09:20, Michael Eickenberg > wrote: > >> You could do >> >> from sklearn.pipeline import make_pipeline >> from sklearn.preprocessing import Normalizer >> from sklearn.cluster import KMeans # (or e.g. MiniBatchKMeans) >> >> spherical_kmeans = make_pipeline(Normalizer(), KMeans(n_clusters=5)) >> >> >> >> On Tue, Jun 28, 2016 at 12:28 AM, JAGANADH G > > wrote: >> >>> Hi Fred and Michel, >>> >>> Thanks for the reply . I think I git this and am able to run it. >>> >>> >>> Best >>> Jagan >>> >>> >>> On Mon, Jun 27, 2016 at 1:03 PM, Fred Mailhot >> > wrote: >>> >>>> Per the example here: >>>> >>>> >>>> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html >>>> >>>> if your inputs are normalized, sklearn's kmeans behaves like sperical >>>> kmeans (unless I'm misunderstanding something, which is certainly possible, >>>> caveat lector, &c )... >>>> On Jun 27, 2016 12:13 PM, "Michael Eickenberg" < >>>> michael.eickenberg at gmail.com >>>> > wrote: >>>> >>>>> hmm, not an answer, and off the top of my head: >>>>> if you normalize your data points to l2 norm equal 1, and then use >>>>> standard kmeans with euclidean distance (which then amounts to 2 - 2 >>>>> cos(angle between points)) would this be enough for your purposes? (with a >>>>> bit of luck there may even be some sort of correspondence) >>>>> >>>>> Michael >>>>> >>>>> On Monday, June 27, 2016, JAGANADH G >>>> > wrote: >>>>> >>>>>> Hi , >>>>>> is there any Python package available for experiment with Sperical >>>>>> Kmeans ? >>>>>> >>>>>> >>>>>> -- >>>>>> ********************************** >>>>>> JAGANADH G >>>>>> http://jaganadhg.in >>>>>> *ILUGCBE* >>>>>> http://ilugcbe.org.in >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> ********************************** >>> JAGANADH G >>> http://jaganadhg.in >>> *ILUGCBE* >>> http://ilugcbe.org.in >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Jun 28 02:45:30 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 28 Jun 2016 16:45:30 +1000 Subject: [scikit-learn] Spherical Kmeans #OT In-Reply-To: References: Message-ID: It may be useful in a pipeline if you need to normalise between a preceding transformer and a following estimator. On 28 June 2016 at 16:09, Michael Eickenberg wrote: > well, true :) > > but you can put it in pipelines! :) > > (so, in that logic, is there any reason for keeping it in the package?) > > > On Tuesday, June 28, 2016, Joel Nothman wrote: > >> (Since Normalizer is applied to each sample independently, the >> Pipeline/Transformer mechanism doesn't actually provide any benefit over >> sklearn.preprocessing.normalize) >> >> On 28 June 2016 at 09:20, Michael Eickenberg < >> michael.eickenberg at gmail.com> wrote: >> >>> You could do >>> >>> from sklearn.pipeline import make_pipeline >>> from sklearn.preprocessing import Normalizer >>> from sklearn.cluster import KMeans # (or e.g. MiniBatchKMeans) >>> >>> spherical_kmeans = make_pipeline(Normalizer(), KMeans(n_clusters=5)) >>> >>> >>> >>> On Tue, Jun 28, 2016 at 12:28 AM, JAGANADH G >>> wrote: >>> >>>> Hi Fred and Michel, >>>> >>>> Thanks for the reply . I think I git this and am able to run it. >>>> >>>> >>>> Best >>>> Jagan >>>> >>>> >>>> On Mon, Jun 27, 2016 at 1:03 PM, Fred Mailhot >>>> wrote: >>>> >>>>> Per the example here: >>>>> >>>>> >>>>> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html >>>>> >>>>> if your inputs are normalized, sklearn's kmeans behaves like sperical >>>>> kmeans (unless I'm misunderstanding something, which is certainly possible, >>>>> caveat lector, &c )... >>>>> On Jun 27, 2016 12:13 PM, "Michael Eickenberg" < >>>>> michael.eickenberg at gmail.com> wrote: >>>>> >>>>>> hmm, not an answer, and off the top of my head: >>>>>> if you normalize your data points to l2 norm equal 1, and then use >>>>>> standard kmeans with euclidean distance (which then amounts to 2 - 2 >>>>>> cos(angle between points)) would this be enough for your purposes? (with a >>>>>> bit of luck there may even be some sort of correspondence) >>>>>> >>>>>> Michael >>>>>> >>>>>> On Monday, June 27, 2016, JAGANADH G wrote: >>>>>> >>>>>>> Hi , >>>>>>> is there any Python package available for experiment with Sperical >>>>>>> Kmeans ? >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> ********************************** >>>>>>> JAGANADH G >>>>>>> http://jaganadhg.in >>>>>>> *ILUGCBE* >>>>>>> http://ilugcbe.org.in >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> >>>> -- >>>> ********************************** >>>> JAGANADH G >>>> http://jaganadhg.in >>>> *ILUGCBE* >>>> http://ilugcbe.org.in >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hmf at inesctec.pt Tue Jun 28 05:30:29 2016 From: hmf at inesctec.pt (Hugo Ferreira) Date: Tue, 28 Jun 2016 10:30:29 +0100 Subject: [scikit-learn] How do we define a distance metric's parameter for grid search In-Reply-To: References: <5770FF8A.4050800@inescporto.pt> Message-ID: <577243B5.1040301@inescporto.pt> Hi Andrew and Joel. I am going to give this a go. Thanks, Hugo On 27-06-2016 12:37, Joel Nothman wrote: > Hi Hugo, > > Andrew's approach -- using a list of dicts to specify multiple parameter > grids -- is the correct one. > > However, Andrew, you don't need to include parameters that will be > ignored into your parameter grid. The following will be effectively the > same: > > params = > [{'kernel':['poly'],'degree':[1,2,3],'gamma':[1/p,1,2],'coef0':[-1,0,1]}, > {'kernel':['rbf'],'gamma':[1/p,1,2]}, > {'kernel':['sigmoid'],'gamma':[1/p,1,2],'coef0':[-1,0,1]}] > > Joel > > On 27 June 2016 at 20:59, Andrew Howe > wrote: > > I did something similar where I was using GridSearchCV over > different kernel functions for SVM and not all kernel functions use > the same parameters. For example, the *degree* parameter is only > used by the *poly* kernel. > > from sklearn import svm > from sklearn import cross_validation > from sklearn import grid_search > > params = > [{'kernel':['poly'],'degree':[1,2,3],'gamma':[1/p,1,2],'coef0':[-1,0,1]},\ > {'kernel':['rbf'],'gamma':[1/p,1,2],'degree':[3],'coef0':[0]},\ > {'kernel':['sigmoid'],'gamma':[1/p,1,2],'coef0':[-1,0,1],'degree':[3]}] > GSC = grid_search.GridSearchCV(estimator = svm.SVC(), param_grid = > params,\ > cv = cvrand, n_jobs = -1) > > This worked in this instance because the svm.SVC() object only > passes parameters to the kernel functions as needed: > Inline image 1 > > Hence, even though my list of dicts includes all three parameters > for all types of kernels I used, they were selectively ignored. I'm > not sure about parameters for the distance metrics for the KNN > object, but it's a good bet it works the same way. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > Editor-in-Chief, European Journal of Mathematical Sciences > Executive Editor, European Journal of Pure and Applied Mathematics > www.andrewhowe.com > http://www.linkedin.com/in/ahowe42 > https://www.researchgate.net/profile/John_Howe12/ > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > On Mon, Jun 27, 2016 at 1:27 PM, Hugo Ferreira > wrote: > > Hello, > > I have posted this question in Stackoverflow and did not get an > answer. This seems to be a basic usage question and am therefore > sending it here. > > I have following code snippet that attempts to do a grid search > in which one of the grid parameters are the distance metrics to > be used for the KNN algorithm. The example below fails if I use > "wminkowski", "seuclidean" or "mahalanobis" distances metrics. > > # Define the parameter values that should be searched > k_range = range(1,31) > weights = ['uniform' , 'distance'] > algos = ['auto', 'ball_tree', 'kd_tree', 'brute'] > leaf_sizes = range(10, 60, 10) > metrics = ["euclidean", "manhattan", "chebyshev", "minkowski", > "mahalanobis"] > > param_grid = dict(n_neighbors = list(k_range), weights = > weights, algorithm = algos, leaf_size = list(leaf_sizes), > metric=metrics) > param_grid > > # Instantiate the algorithm > knn = KNeighborsClassifier(n_neighbors=10) > > # Instantiate the grid > grid = GridSearchCV(knn, param_grid=param_grid, cv=10, > scoring='accuracy', n_jobs=-1) > > # Fit the models using the grid parameters > grid.fit(X,y) > > I assume this is because I have to set or define the ranges for > the various distance parameters (for example p, w for > ?wminkowski? - WMinkowskiDistance ). The "minkowski" distance > may be working because its "p" parameter has the default 2. > > So my questions are: > > 1. Can we set the range of parameters for the distance metrics > for the grid search and if so how? > 2. Can we set the value of a parameters for the distance metrics > for the grid search and if so how? > > Hope the question is clear. > TIA > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From hmf at inesctec.pt Tue Jun 28 07:03:11 2016 From: hmf at inesctec.pt (Hugo Ferreira) Date: Tue, 28 Jun 2016 12:03:11 +0100 Subject: [scikit-learn] How do we define a distance metric's parameter for grid search In-Reply-To: References: <5770FF8A.4050800@inescporto.pt> Message-ID: <5772596F.4050603@inescporto.pt> Hello, On 27-06-2016 12:37, Joel Nothman wrote: > Hi Hugo, > > Andrew's approach -- using a list of dicts to specify multiple parameter > grids -- is the correct one. > > However, Andrew, you don't need to include parameters that will be > ignored into your parameter grid. The following will be effectively the > same: > > params = > [{'kernel':['poly'],'degree':[1,2,3],'gamma':[1/p,1,2],'coef0':[-1,0,1]}, > {'kernel':['rbf'],'gamma':[1/p,1,2]}, > {'kernel':['sigmoid'],'gamma':[1/p,1,2],'coef0':[-1,0,1]}] > I tried to do this but am having errors. Seems like I need to use the 'metric_params' parameter but I cannot get it right. Here are some of the attempts I made: {'metric': ['wminkowski'], 'metric_params':[{ 'w': [0.01, 0.1, 1, 10, 100], 'p': [1,2,3,4,5]}], 'n_neighbors': list(k_range), 'weights': weights, 'algorithm': algos, 'leaf_size': list(leaf_sizes) } {'metric': ['wminkowski'], 'metric_params':[{ 'w': 0.01, 'p': 1}], 'n_neighbors': list(k_range), 'weights': weights, 'algorithm': algos, 'leaf_size': list(leaf_sizes) } {'metric': ['wminkowski'], 'metric_params':[dict(w=0.01,p=1)], 'n_neighbors': list(k_range), 'weights': weights, 'algorithm': algos, 'leaf_size': list(leaf_sizes) } The last two give me the following error: Exception ignored in: 'sklearn.neighbors.dist_metrics.get_vec_ptr' ValueError: Buffer has wrong number of dimensions (expected 1, got 0) Can anyone see what I am doing wrong? TIA, > Joel > > On 27 June 2016 at 20:59, Andrew Howe > wrote: > > I did something similar where I was using GridSearchCV over > different kernel functions for SVM and not all kernel functions use > the same parameters. For example, the *degree* parameter is only > used by the *poly* kernel. > > from sklearn import svm > from sklearn import cross_validation > from sklearn import grid_search > > params = > [{'kernel':['poly'],'degree':[1,2,3],'gamma':[1/p,1,2],'coef0':[-1,0,1]},\ > {'kernel':['rbf'],'gamma':[1/p,1,2],'degree':[3],'coef0':[0]},\ > {'kernel':['sigmoid'],'gamma':[1/p,1,2],'coef0':[-1,0,1],'degree':[3]}] > GSC = grid_search.GridSearchCV(estimator = svm.SVC(), param_grid = > params,\ > cv = cvrand, n_jobs = -1) > > This worked in this instance because the svm.SVC() object only > passes parameters to the kernel functions as needed: > Inline image 1 > > Hence, even though my list of dicts includes all three parameters > for all types of kernels I used, they were selectively ignored. I'm > not sure about parameters for the distance metrics for the KNN > object, but it's a good bet it works the same way. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > Editor-in-Chief, European Journal of Mathematical Sciences > Executive Editor, European Journal of Pure and Applied Mathematics > www.andrewhowe.com > http://www.linkedin.com/in/ahowe42 > https://www.researchgate.net/profile/John_Howe12/ > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > On Mon, Jun 27, 2016 at 1:27 PM, Hugo Ferreira > wrote: > > Hello, > > I have posted this question in Stackoverflow and did not get an > answer. This seems to be a basic usage question and am therefore > sending it here. > > I have following code snippet that attempts to do a grid search > in which one of the grid parameters are the distance metrics to > be used for the KNN algorithm. The example below fails if I use > "wminkowski", "seuclidean" or "mahalanobis" distances metrics. > > # Define the parameter values that should be searched > k_range = range(1,31) > weights = ['uniform' , 'distance'] > algos = ['auto', 'ball_tree', 'kd_tree', 'brute'] > leaf_sizes = range(10, 60, 10) > metrics = ["euclidean", "manhattan", "chebyshev", "minkowski", > "mahalanobis"] > > param_grid = dict(n_neighbors = list(k_range), weights = > weights, algorithm = algos, leaf_size = list(leaf_sizes), > metric=metrics) > param_grid > > # Instantiate the algorithm > knn = KNeighborsClassifier(n_neighbors=10) > > # Instantiate the grid > grid = GridSearchCV(knn, param_grid=param_grid, cv=10, > scoring='accuracy', n_jobs=-1) > > # Fit the models using the grid parameters > grid.fit(X,y) > > I assume this is because I have to set or define the ranges for > the various distance parameters (for example p, w for > ?wminkowski? - WMinkowskiDistance ). The "minkowski" distance > may be working because its "p" parameter has the default 2. > > So my questions are: > > 1. Can we set the range of parameters for the distance metrics > for the grid search and if so how? > 2. Can we set the value of a parameters for the distance metrics > for the grid search and if so how? > > Hope the question is clear. > TIA > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From joel.nothman at gmail.com Tue Jun 28 07:45:17 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 28 Jun 2016 21:45:17 +1000 Subject: [scikit-learn] How do we define a distance metric's parameter for grid search In-Reply-To: <5772596F.4050603@inescporto.pt> References: <5770FF8A.4050800@inescporto.pt> <5772596F.4050603@inescporto.pt> Message-ID: > > I tried to do this but am having errors. Seems like I need to use the > 'metric_params' parameter but I cannot get it right. Here are some of the > attempts I made: > > {'metric': ['wminkowski'], 'metric_params':[{ 'w': [0.01, 0.1, 1, 10, > 100], 'p': [1,2,3,4,5]}], 'n_neighbors': list(k_range), 'weights': weights, > 'algorithm': algos, 'leaf_size': list(leaf_sizes) } > > {'metric': ['wminkowski'], 'metric_params':[{ 'w': 0.01, 'p': 1}], > 'n_neighbors': list(k_range), 'weights': weights, 'algorithm': algos, > 'leaf_size': list(leaf_sizes) } > > {'metric': ['wminkowski'], 'metric_params':[dict(w=0.01,p=1)], > 'n_neighbors': list(k_range), 'weights': weights, 'algorithm': algos, > 'leaf_size': list(leaf_sizes) } > > The last two give me the following error: > > Exception ignored in: 'sklearn.neighbors.dist_metrics.get_vec_ptr' > ValueError: Buffer has wrong number of dimensions (expected 1, got 0) > > Can anyone see what I am doing wrong? > > I can see *something* you're doing wrong. Firstly, your second and third examples produce identical Python objects. But in metric_params, p should be an integer, w should be a 1-dimensional array. In your first example, both p and w will be 1d, and in your second and third, both are scalars. You want something like ... 'metric_params': [{'w': [0.01, 0.1, 1, 10, 100], 'p': 1}] ... except that those values for 'w' seem a bit strange for weights (are you sure you want wminkowski?). You can try multiple 'p' with 'metric_params': [{'w': weights, 'p': 1}, {'w': weights, 'p': 2}, {'w': weights, 'p': 3}, ...] -------------- next part -------------- An HTML attachment was scrubbed... URL: From hmf at inesctec.pt Tue Jun 28 08:52:16 2016 From: hmf at inesctec.pt (Hugo Ferreira) Date: Tue, 28 Jun 2016 13:52:16 +0100 Subject: [scikit-learn] How do we define a distance metric's parameter for grid search In-Reply-To: References: <5770FF8A.4050800@inescporto.pt> <5772596F.4050603@inescporto.pt> Message-ID: <57727300.1010500@inescporto.pt> Hi, On 28-06-2016 12:45, Joel Nothman wrote: > I tried to do this but am having errors. Seems like I need to use > the 'metric_params' parameter but I cannot get it right. Here are > some of the attempts I made: > > {'metric': ['wminkowski'], 'metric_params':[{ 'w': [0.01, 0.1, 1, 10, > 100], 'p': [1,2,3,4,5]}], 'n_neighbors': list(k_range), 'weights': > weights, 'algorithm': algos, 'leaf_size': list(leaf_sizes) } > > {'metric': ['wminkowski'], 'metric_params':[{ 'w': 0.01, 'p': 1}], > 'n_neighbors': list(k_range), 'weights': weights, 'algorithm': algos, > 'leaf_size': list(leaf_sizes) } > > {'metric': ['wminkowski'], 'metric_params':[dict(w=0.01,p=1)], > 'n_neighbors': list(k_range), 'weights': weights, 'algorithm': algos, > 'leaf_size': list(leaf_sizes) } > > The last two give me the following error: > > Exception ignored in: 'sklearn.neighbors.dist_metrics.get_vec_ptr' > ValueError: Buffer has wrong number of dimensions (expected 1, got > 0) > > Can anyone see what I am doing wrong? > > > I can see *something* you're doing wrong. Firstly, your second and > third examples produce identical Python objects. > Yeah. Its called desperation :-) > But in metric_params, p should be an integer, w should be a > 1-dimensional array. In your first example, both p and w will be 1d, > and in your second and third, both are scalars. You want something > like ... 'metric_params': [{'w': [0.01, 0.1, 1, 10, 100], 'p': 1}] > ... except that those values for 'w' seem a bit strange for weights > (are you sure you want wminkowski?). Just testing the code. I'll need to learn what values are the most appropriate here. Are these the weights to be applied to each feature (number of weights = number of features)? Wonder how I can use this during feature selection. > You can try multiple 'p' with 'metric_params': > [{'w': weights, 'p': 1}, {'w': weights, 'p': 2}, {'w': weights, 'p': > 3}, ...] > > I have used the simplest case and set of parameters as follows (before attempting multiple parameters as you have shown above): param_grid = [ {'metric': ['wminkowski'], 'metric_params':[{'w':[10, 20],'p':1}] } ] and I get the error: File "", line unknown SyntaxError: invalid or missing encoding declaration for '/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/ball_tree.cpython-34m.so' Ok, so this may be due to the specific type of tree being used. I then set the parameters to: {'metric': ['wminkowski'], 'metric_params':[{'w':[10.0, 20.0],'p':1}], 'algorithm': algos } where algos is: algos = ['brute'] Which results in the following error: AttributeError: 'list' object has no attribute 'dtype' So it seems we need to use an array explicitly. The following will work. {'metric': ['wminkowski'], 'metric_params':[{'w':np.array([10.0, 20.0]),'p':1}], 'algorithm': algos } Thanks for the help. Hugo From dbsullivan23 at gmail.com Wed Jun 29 06:33:33 2016 From: dbsullivan23 at gmail.com (Daniel Sullivan) Date: Wed, 29 Jun 2016 12:33:33 +0200 Subject: [scikit-learn] [Scikit-learn-general] Gradient Descent In-Reply-To: References: Message-ID: (Sent to wrong mailing list, sorry for duplication) Hi Chaitanya, Yes, Stochastic Gradient Descents algorithm logic is written in Cython. The implementation can be viewed here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/sgd_fast.pyx Hope that helps, Danny On Wed, Jun 29, 2016 at 10:31 AM, Chaitanya Prasad wrote: > Hello > > I'm a student currently trying to benchmark a few Black box optimized > algorithms against Gradient Descent Algorithms. It would be extremely > helpful for me if someone could tell me whether the Stochastic Gradient > Descent for scikit has been written in pure Python or whether it has been > optmized using Cython in any manner. > > Thanks and Regards. > Chaitanya > > > ------------------------------------------------------------------------------ > Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San > Francisco, CA to explore cutting-edge tech and listen to tech luminaries > present their vision of the future. This family event has something for > everyone, including kids. Get more information and register today. > http://sdm.link/attshape > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From basilbeirouti at gmail.com Thu Jun 30 18:23:18 2016 From: basilbeirouti at gmail.com (Basil Beirouti) Date: Thu, 30 Jun 2016 17:23:18 -0500 Subject: [scikit-learn] Adding BM25 to sklearn.feature_extraction.text (Update) Message-ID: Hello everyone, I have successfully created a few versions of the BM25Transformer. I looked at TFIDFTransformer for guidance and I noticed that it outputs a sparse matrix when given a sparse termcount matrix as an input. Unfortunately, the fastest implementation of BM25Transformer that I have been able to come up with does NOT output a sparse matrix, it will return a regular numpy matrix. Benchmarked against the entire 20newsgroups corpus, here is how they perform (assuming input is csr_matrix for all): 1.) finishes in 4 seconds, outputs a regular numpy matrix 2.) finishes in 30 seconds, outputs a dok_matrix 3.) finishes in 130 seconds, outputs a regular numpy matrix It's worth noting that using algorithm 1 and converting the output to a sparse matrix still takes less time than 3, and takes about as long as 2. So my question is, how important is it that my BM25Transformer outputs a sparse matrix? I'm going to try another implementation which looks directly at the data, indices, and indptr attributes of the inputted csr_matrix. I just wanted to check in and see what people thought. Sincerely, Basil Beirouti -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jun 30 18:38:15 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 1 Jul 2016 08:38:15 +1000 Subject: [scikit-learn] Adding BM25 to sklearn.feature_extraction.text (Update) In-Reply-To: References: Message-ID: I don't see what about BM25, at least as presented at https://en.wikipedia.org/wiki/Okapi_BM25, should prevent using CSR operations efficiently. Show us your code. On 1 July 2016 at 08:23, Basil Beirouti wrote: > Hello everyone, > > I have successfully created a few versions of the BM25Transformer. I > looked at TFIDFTransformer for guidance and I noticed that it outputs a > sparse matrix when given a sparse termcount matrix as an input. > > Unfortunately, the fastest implementation of BM25Transformer that I have > been able to come up with does NOT output a sparse matrix, it will return a > regular numpy matrix. > > Benchmarked against the entire 20newsgroups corpus, here is how they > perform (assuming input is csr_matrix for all): > > 1.) finishes in 4 seconds, outputs a regular numpy matrix > 2.) finishes in 30 seconds, outputs a dok_matrix > 3.) finishes in 130 seconds, outputs a regular numpy matrix > > It's worth noting that using algorithm 1 and converting the output to a > sparse matrix still takes less time than 3, and takes about as long as 2. > > So my question is, how important is it that my BM25Transformer outputs a > sparse matrix? > > I'm going to try another implementation which looks directly at the data, > indices, and indptr attributes of the inputted csr_matrix. I just wanted to > check in and see what people thought. > > Sincerely, > Basil Beirouti > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Thu Jun 30 18:33:49 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Thu, 30 Jun 2016 18:33:49 -0400 Subject: [scikit-learn] Adding BM25 to sklearn.feature_extraction.text (Update) In-Reply-To: References: Message-ID: <6411ECB7-BD7C-4960-B847-B3D633DD848A@sebastianraschka.com> Hi, Basil, I?d say runtime may not be the main concern regarding sparse vs. dense. In my opinion, the main reason to use sparse arrays would be memory useage. I.e., text data is typically rather large (esp. high-dimensional, sparse feature vector). So one limitation with scikit-learn is typically memory capacity, especially if you are using multiprocessing via the cv param. PS: > regular numpy matrix I think you mean "numpy array?? (Since there?s a numpy matrix datastruct in numpy as well, however, almost no one uses it) Best, Sebastian > On Jun 30, 2016, at 6:23 PM, Basil Beirouti wrote: > > Hello everyone, > > I have successfully created a few versions of the BM25Transformer. I looked at TFIDFTransformer for guidance and I noticed that it outputs a sparse matrix when given a sparse termcount matrix as an input. > > Unfortunately, the fastest implementation of BM25Transformer that I have been able to come up with does NOT output a sparse matrix, it will return a regular numpy matrix. > > Benchmarked against the entire 20newsgroups corpus, here is how they perform (assuming input is csr_matrix for all): > > 1.) finishes in 4 seconds, outputs a regular numpy matrix > 2.) finishes in 30 seconds, outputs a dok_matrix > 3.) finishes in 130 seconds, outputs a regular numpy matrix > > It's worth noting that using algorithm 1 and converting the output to a sparse matrix still takes less time than 3, and takes about as long as 2. > > So my question is, how important is it that my BM25Transformer outputs a sparse matrix? > > I'm going to try another implementation which looks directly at the data, indices, and indptr attributes of the inputted csr_matrix. I just wanted to check in and see what people thought. > > Sincerely, > Basil Beirouti > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn