From dmoisset at machinalis.com Mon Aug 1 12:15:14 2016 From: dmoisset at machinalis.com (Daniel Moisset) Date: Mon, 1 Aug 2016 17:15:14 +0100 Subject: [scikit-learn] Is there any official position on PEP484/mypy? In-Reply-To: <20160729195718.GO787902@phare.normalesup.org> References: <20160728164339.GD2110660@phare.normalesup.org> <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com> <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com> <20160729195718.GO787902@phare.normalesup.org> Message-ID: On Fri, Jul 29, 2016 at 8:57 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > > Can you summarize once again in very simple terms what would be the big > benefits? > Benefits for regular scikit-learn users 1. Reliable information on method signatures in a standarized way ("reliable" in the sense of "automatically verified") 2. Better integration with tools supporting PEP-484 (editors, documentation tools). This is a small set now, but I expect it to grow (and it's also an egg and chicken problem, support has to start somewhere) Benefits for scikit-learn users also using mypy and/or PEP-484 (probably not a large set, but I know a few people :) ) 0. Same as the rest of the users 1. Early detection of errors in own code while writing code based on SKL 2. Making own code more readable/explicit by annotating functions that receive/return SKL types (and verifying that annotations) Benefits for scikit-learn developers 1. Some extra checks that changes keep internal consistency 2. (Future) possible simplification of typing information in docstrings, which would make themselves redundant (this would require updating doc generators) Regarding the cost for contributing, an scenario where you get a CI error due to mypy would be because: * the change in the code somewhat changed the existing accepted/returned types, which is a change in the API and should actually be verified * the change in the code extended the signature of an existing function (what Andreas mentioned); in this situation it's similar to a PR that adds an argument and doesn't update the docstring (only that this is automatically caught). WRT to the second issue, the error here might be confusing when using the "one line" syntax because arguments may "misalign" with their signatures. The multiline version (or the python3-only form) is safer in that sense (in fact, adding an argument there will not produce a CI problem because its unannotated and assumed to be "any type"). Adding new modules/methods without no annotations wouldn't produce an error, just an incompleteness in the annotations A possible source of problems like the one you mention is that the implementation of the annotated methods will be checked, and sometimes you'll get a warning about a local variable if mypy can't infer its type (it happens sometimes when assigning an empty list to a local, where mypy knows that it's a list but doesn't know the element type). But in that case I think the message you get is very obvious. -- Daniel F. Moisset - UK Country Manager www.machinalis.com Skype: @dmoisset -------------- next part -------------- An HTML attachment was scrubbed... URL: From luizfgoncalves at dcc.ufmg.br Mon Aug 1 15:55:27 2016 From: luizfgoncalves at dcc.ufmg.br (luizfgoncalves at dcc.ufmg.br) Date: Mon, 1 Aug 2016 16:55:27 -0300 Subject: [scikit-learn] Install sklearn into a specific folder to make some changes Message-ID: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br> I'm looking for the best way to install sklearn into a specific folder so I can make changes for my work, without worrying about bugging my main sklearn installation (as I use the default version for some experiments too). I tried to clone the git repository and use "python setup.py install", but I'm afraid it will change my user installation too. Right now, what I want is to edit a file called splitter.pyx (on tree folder), compile/install sklearn so it will work with my changes, and test it. What is the best way to do it without causing problems with my main sklearn installation? Thanks a lot for your attention From t3kcit at gmail.com Mon Aug 1 16:08:44 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 1 Aug 2016 16:08:44 -0400 Subject: [scikit-learn] Install sklearn into a specific folder to make some changes In-Reply-To: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br> References: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br> Message-ID: <10a9c6f8-62b4-dc26-fb6a-336aaafeb286@gmail.com> Hi. The best is probably to use a virtual environment or conda environment specific for this changed version of scikit-learn. In that environment you could just run an "install" and it would not mess with your other environments. If you don't want to go that way, you can also do ``python setup.py build_ext -i`` to build inplace and then add this path to your python path (PYTONPATH environment variable or sys.path.insert in the script or many other ways). Best, Andy On 08/01/2016 03:55 PM, luizfgoncalves at dcc.ufmg.br wrote: > I'm looking for the best way to install sklearn into a specific folder so > I can make changes for my work, without worrying about bugging my main > sklearn installation (as I use the default version for some experiments > too). > > I tried to clone the git repository and use "python setup.py install", but > I'm afraid it will change my user installation too. > > Right now, what I want is to edit a file called splitter.pyx (on tree > folder), compile/install sklearn so it will work with my changes, and test > it. > > What is the best way to do it without causing problems with my main > sklearn installation? > > Thanks a lot for your attention > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From michael.eickenberg at gmail.com Mon Aug 1 16:15:30 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Mon, 1 Aug 2016 22:15:30 +0200 Subject: [scikit-learn] Install sklearn into a specific folder to make some changes In-Reply-To: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br> References: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br> Message-ID: There are several ways of achieving this. One is to build scikit-learn in place by going into the sklearn clone and typing make in or alternatively python setup.py build_ext --inplace # (i think) Then you can use the environment variable PYTHONPATH, set to the github clone, and python will give precedence to the clone whenever the variable is set. As an alternative, you can install your clone using python setup.py develop and then work on a branch. Checkout master and rebuild whenever you need it. This would entail working on the same clone for master and development (so your builtin default sklearn would be overrridden) hth, Michael On Monday, August 1, 2016, wrote: > I'm looking for the best way to install sklearn into a specific folder so > I can make changes for my work, without worrying about bugging my main > sklearn installation (as I use the default version for some experiments > too). > > I tried to clone the git repository and use "python setup.py install", but > I'm afraid it will change my user installation too. > > Right now, what I want is to edit a file called splitter.pyx (on tree > folder), compile/install sklearn so it will work with my changes, and test > it. > > What is the best way to do it without causing problems with my main > sklearn installation? > > Thanks a lot for your attention > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Mon Aug 1 16:09:39 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Mon, 1 Aug 2016 16:09:39 -0400 Subject: [scikit-learn] Install sklearn into a specific folder to make some changes In-Reply-To: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br> References: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br> Message-ID: Hi, I would highly recommend you to work with virtual environments here. E.g., look into Anaconda/Miniconda (http://conda.pydata.org/miniconda.html, http://conda.pydata.org/docs/using/using.html), which makes this process most convenient in my opinion. Alternatively, I would use Python?s virtualenv (http://docs.python-guide.org/en/latest/dev/virtualenvs/). Best, Sebastian > On Aug 1, 2016, at 3:55 PM, luizfgoncalves at dcc.ufmg.br wrote: > > I'm looking for the best way to install sklearn into a specific folder so > I can make changes for my work, without worrying about bugging my main > sklearn installation (as I use the default version for some experiments > too). > > I tried to clone the git repository and use "python setup.py install", but > I'm afraid it will change my user installation too. > > Right now, what I want is to edit a file called splitter.pyx (on tree > folder), compile/install sklearn so it will work with my changes, and test > it. > > What is the best way to do it without causing problems with my main > sklearn installation? > > Thanks a lot for your attention > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From michael.eickenberg at gmail.com Mon Aug 1 16:17:08 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Mon, 1 Aug 2016 22:17:08 +0200 Subject: [scikit-learn] Install sklearn into a specific folder to make some changes In-Reply-To: <10a9c6f8-62b4-dc26-fb6a-336aaafeb286@gmail.com> References: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br> <10a9c6f8-62b4-dc26-fb6a-336aaafeb286@gmail.com> Message-ID: On Monday, August 1, 2016, Andreas Mueller wrote: > Hi. > The best is probably to use a virtual environment or conda environment > specific for this changed version of scikit-learn. > In that environment you could just run an "install" and it would not mess > with your other environments. +1! > If you don't want to go that way, you can also do ``python setup.py > build_ext -i`` to build inplace and then add this > path to your python path (PYTONPATH environment variable or > sys.path.insert in the script or many other ways). > > Best, > Andy > > On 08/01/2016 03:55 PM, luizfgoncalves at dcc.ufmg.br wrote: > >> I'm looking for the best way to install sklearn into a specific folder so >> I can make changes for my work, without worrying about bugging my main >> sklearn installation (as I use the default version for some experiments >> too). >> >> I tried to clone the git repository and use "python setup.py install", but >> I'm afraid it will change my user installation too. >> >> Right now, what I want is to edit a file called splitter.pyx (on tree >> folder), compile/install sklearn so it will work with my changes, and test >> it. >> >> What is the best way to do it without causing problems with my main >> sklearn installation? >> >> Thanks a lot for your attention >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dale.T.Smith at macys.com Tue Aug 2 08:34:34 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Tue, 2 Aug 2016 12:34:34 +0000 Subject: [scikit-learn] Install sklearn into a specific folder to make some changes In-Reply-To: References: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br> <10a9c6f8-62b4-dc26-fb6a-336aaafeb286@gmail.com> Message-ID: I agree with everyone else ? conda environments are specially designed for this situation. I?ve not used virtualenv myself (http://docs.python-guide.org/en/latest/dev/virtualenvs/). I?m an Anaconda user. __________________________________________________________________________________________ Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Michael Eickenberg Sent: Monday, August 1, 2016 4:17 PM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Install sklearn into a specific folder to make some changes ? EXT MSG: On Monday, August 1, 2016, Andreas Mueller > wrote: Hi. The best is probably to use a virtual environment or conda environment specific for this changed version of scikit-learn. In that environment you could just run an "install" and it would not mess with your other environments. +1! If you don't want to go that way, you can also do ``python setup.py build_ext -i`` to build inplace and then add this path to your python path (PYTONPATH environment variable or sys.path.insert in the script or many other ways). Best, Andy On 08/01/2016 03:55 PM, luizfgoncalves at dcc.ufmg.br wrote: I'm looking for the best way to install sklearn into a specific folder so I can make changes for my work, without worrying about bugging my main sklearn installation (as I use the default version for some experiments too). I tried to clone the git repository and use "python setup.py install", but I'm afraid it will change my user installation too. Right now, what I want is to edit a file called splitter.pyx (on tree folder), compile/install sklearn so it will work with my changes, and test it. What is the best way to do it without causing problems with my main sklearn installation? Thanks a lot for your attention _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmoisset at machinalis.com Tue Aug 2 09:34:17 2016 From: dmoisset at machinalis.com (Daniel Moisset) Date: Tue, 2 Aug 2016 14:34:17 +0100 Subject: [scikit-learn] Is there any official position on PEP484/mypy? In-Reply-To: References: <20160728164339.GD2110660@phare.normalesup.org> <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com> <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com> <20160729195718.GO787902@phare.normalesup.org> Message-ID: A couple of things I forgot to mention: * One relevant consequence is that, to add annotations on the code, scikit-learn should depend on the "typing"[1] module which contains some of the basic names imported and used in annotations. It's a stdlib module in python 3.5, but the PyPI package backports it to python 2.7 and newer (I'm not sure how it works with Python 2.6, which might be an issue) * As an example of the kind of bugs that mypy can find, someone here already found a documentation bug in the sklearn.svm.SVC() initializer; the "kernel" parameter is described as "string"[2], when it's actually a "string or callable" (which can be read in the "small print" description of the argument). That kind of slips would be automatically prevented if declared as an annotation with mypy on the CI. Also it would be more clear what is the signature of the callable directly instead of looking up additional documentation on kernel functions or digging into the source [1] https://pypi.python.org/pypi/typing [2] http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC On Mon, Aug 1, 2016 at 5:15 PM, Daniel Moisset wrote: > On Fri, Jul 29, 2016 at 8:57 PM, Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > >> >> Can you summarize once again in very simple terms what would be the big >> benefits? >> > > Benefits for regular scikit-learn users > > 1. Reliable information on method signatures in a standarized way > ("reliable" in the sense of "automatically verified") > 2. Better integration with tools supporting PEP-484 (editors, > documentation tools). This is a small set now, but I expect it to grow (and > it's also an egg and chicken problem, support has to start somewhere) > > Benefits for scikit-learn users also using mypy and/or PEP-484 (probably > not a large set, but I know a few people :) ) > > 0. Same as the rest of the users > 1. Early detection of errors in own code while writing code based on SKL > 2. Making own code more readable/explicit by annotating functions that > receive/return SKL types (and verifying that annotations) > > Benefits for scikit-learn developers > > 1. Some extra checks that changes keep internal consistency > 2. (Future) possible simplification of typing information in docstrings, > which would make themselves redundant (this would require updating doc > generators) > > Regarding the cost for contributing, an scenario where you get a CI error > due to mypy would be because: > > * the change in the code somewhat changed the existing accepted/returned > types, which is a change in the API and should actually be verified > * the change in the code extended the signature of an existing function > (what Andreas mentioned); in this situation it's similar to a PR that adds > an argument and doesn't update the docstring (only that this is > automatically caught). > > WRT to the second issue, the error here might be confusing when using the > "one line" syntax because arguments may "misalign" with their signatures. > The multiline version (or the python3-only form) is safer in that sense (in > fact, adding an argument there will not produce a CI problem because its > unannotated and assumed to be "any type"). > > Adding new modules/methods without no annotations wouldn't produce an > error, just an incompleteness in the annotations > > A possible source of problems like the one you mention is that the > implementation of the annotated methods will be checked, and sometimes > you'll get a warning about a local variable if mypy can't infer its type > (it happens sometimes when assigning an empty list to a local, where mypy > knows that it's a list but doesn't know the element type). But in that case > I think the message you get is very obvious. > > -- > Daniel F. Moisset - UK Country Manager > www.machinalis.com > Skype: @dmoisset > -- Daniel F. Moisset - UK Country Manager www.machinalis.com Skype: @dmoisset -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Aug 2 10:06:02 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 3 Aug 2016 00:06:02 +1000 Subject: [scikit-learn] Is there any official position on PEP484/mypy? In-Reply-To: References: <20160728164339.GD2110660@phare.normalesup.org> <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com> <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com> <20160729195718.GO787902@phare.normalesup.org> Message-ID: I certainly see the benefit, and think we would benefit also from finding test coverage holes wrt input type. But I think without ndarray/sparse matrix type support, we're not going to be able to annotate most of our code in sufficient detail. On 2 August 2016 at 23:34, Daniel Moisset wrote: > A couple of things I forgot to mention: > > * One relevant consequence is that, to add annotations on the code, > scikit-learn should depend on the "typing"[1] module which contains some of > the basic names imported and used in annotations. It's a stdlib module in > python 3.5, but the PyPI package backports it to python 2.7 and newer (I'm > not sure how it works with Python 2.6, which might be an issue) > * As an example of the kind of bugs that mypy can find, someone here > already found a documentation bug in the sklearn.svm.SVC() initializer; the > "kernel" parameter is described as "string"[2], when it's actually a > "string or callable" (which can be read in the "small print" description of > the argument). That kind of slips would be automatically prevented if > declared as an annotation with mypy on the CI. Also it would be more clear > what is the signature of the callable directly instead of looking up > additional documentation on kernel functions or digging into the source > > [1] https://pypi.python.org/pypi/typing > [2] > http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC > > > On Mon, Aug 1, 2016 at 5:15 PM, Daniel Moisset > wrote: > >> On Fri, Jul 29, 2016 at 8:57 PM, Gael Varoquaux < >> gael.varoquaux at normalesup.org> wrote: >> >>> >>> Can you summarize once again in very simple terms what would be the big >>> benefits? >>> >> >> Benefits for regular scikit-learn users >> >> 1. Reliable information on method signatures in a standarized way >> ("reliable" in the sense of "automatically verified") >> 2. Better integration with tools supporting PEP-484 (editors, >> documentation tools). This is a small set now, but I expect it to grow (and >> it's also an egg and chicken problem, support has to start somewhere) >> >> Benefits for scikit-learn users also using mypy and/or PEP-484 (probably >> not a large set, but I know a few people :) ) >> >> 0. Same as the rest of the users >> 1. Early detection of errors in own code while writing code based on SKL >> 2. Making own code more readable/explicit by annotating functions that >> receive/return SKL types (and verifying that annotations) >> >> Benefits for scikit-learn developers >> >> 1. Some extra checks that changes keep internal consistency >> 2. (Future) possible simplification of typing information in docstrings, >> which would make themselves redundant (this would require updating doc >> generators) >> >> Regarding the cost for contributing, an scenario where you get a CI error >> due to mypy would be because: >> >> * the change in the code somewhat changed the existing accepted/returned >> types, which is a change in the API and should actually be verified >> * the change in the code extended the signature of an existing function >> (what Andreas mentioned); in this situation it's similar to a PR that adds >> an argument and doesn't update the docstring (only that this is >> automatically caught). >> >> WRT to the second issue, the error here might be confusing when using the >> "one line" syntax because arguments may "misalign" with their signatures. >> The multiline version (or the python3-only form) is safer in that sense (in >> fact, adding an argument there will not produce a CI problem because its >> unannotated and assumed to be "any type"). >> >> Adding new modules/methods without no annotations wouldn't produce an >> error, just an incompleteness in the annotations >> >> A possible source of problems like the one you mention is that the >> implementation of the annotated methods will be checked, and sometimes >> you'll get a warning about a local variable if mypy can't infer its type >> (it happens sometimes when assigning an empty list to a local, where mypy >> knows that it's a list but doesn't know the element type). But in that case >> I think the message you get is very obvious. >> >> -- >> Daniel F. Moisset - UK Country Manager >> www.machinalis.com >> Skype: @dmoisset >> > > > > -- > Daniel F. Moisset - UK Country Manager > www.machinalis.com > Skype: @dmoisset > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Aug 2 13:48:22 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 2 Aug 2016 19:48:22 +0200 Subject: [scikit-learn] Is there any official position on PEP484/mypy? In-Reply-To: References: <20160728164339.GD2110660@phare.normalesup.org> <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com> <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com> <20160729195718.GO787902@phare.normalesup.org> Message-ID: <20160802174822.GD1269350@phare.normalesup.org> > * One relevant consequence is that, to add annotations on the code, > scikit-learn should depend on the "typing"[1] module which contains some of the > basic names imported and used in annotations. It's a stdlib module in python > 3.5, but the PyPI package backports it to python 2.7 and newer (I'm not sure > how it works with Python 2.6, which might be an issue) I am afraid that this is going to be a problem: we have a no dependency policy (beyond numpy and scipy). From t3kcit at gmail.com Tue Aug 2 14:12:17 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 2 Aug 2016 14:12:17 -0400 Subject: [scikit-learn] Is there any official position on PEP484/mypy? In-Reply-To: <20160802174822.GD1269350@phare.normalesup.org> References: <20160728164339.GD2110660@phare.normalesup.org> <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com> <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com> <20160729195718.GO787902@phare.normalesup.org> <20160802174822.GD1269350@phare.normalesup.org> Message-ID: On 08/02/2016 01:48 PM, Gael Varoquaux wrote: >> * One relevant consequence is that, to add annotations on the code, >> scikit-learn should depend on the "typing"[1] module which contains some of the >> basic names imported and used in annotations. It's a stdlib module in python >> 3.5, but the PyPI package backports it to python 2.7 and newer (I'm not sure >> how it works with Python 2.6, which might be an issue) > I am afraid that this is going to be a problem: we have a no dependency > policy (beyond numpy and scipy). I still think this is a point we should discuss further ;) From shee.yu at gmail.com Tue Aug 2 17:02:23 2016 From: shee.yu at gmail.com (Shi Yu) Date: Tue, 2 Aug 2016 16:02:23 -0500 Subject: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 Message-ID: Hello, We trained SVM models in scikit-learn 0.17 and saved it as pickle files. When loading the models back in a lower version of scikit-learn 0.15, the outputs are entirely different. Basically for binary classification problem, for the same test data, it swapped the probabilities and gave an opposite prediction. In 0.17 the probability is [0.02668825, 0.97331175] and the prediction is 1. In 0.15 the probability is [0.97331175, 0.02668825] and the prediction is 0. I wonder is anyone seeing the same issue, or it has been notified. I could provide more details for error replication if required. Best, Shi -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Aug 3 14:29:08 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 3 Aug 2016 14:29:08 -0400 Subject: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 In-Reply-To: References: Message-ID: Hi Shi. In general, there is no guarantee that models built with one version will work in a different version. In particular, loading in an older version when built in a newer version seems something that's tricky to achieve. We might want to warn the user when doing this. The docs are not very explicit about this. Opened an issue: https://github.com/scikit-learn/scikit-learn/issues/7135 Andy On 08/02/2016 05:02 PM, Shi Yu wrote: > Hello, > > We trained SVM models in scikit-learn 0.17 and saved it as pickle > files. When loading the models back in a lower version of scikit-learn > 0.15, the outputs are entirely different. Basically for binary > classification problem, for the same test data, it swapped the > probabilities and gave an opposite prediction. In 0.17 the > probability is [0.02668825, 0.97331175] and the prediction is 1. In > 0.15 the probability is [0.97331175, 0.02668825] and the prediction is 0. > > I wonder is anyone seeing the same issue, or it has been notified. I > could provide more details for error replication if required. > > Best, > > Shi > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From shee.yu at gmail.com Wed Aug 3 15:02:46 2016 From: shee.yu at gmail.com (Shi Yu) Date: Wed, 3 Aug 2016 14:02:46 -0500 Subject: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 In-Reply-To: References: Message-ID: Hi Andy, Thanks for the feedback. Indeed we think it would be a good idea to enforce version persistence something like in serialVersionUID Java here. We deployed models trained on our laptop onto our clusters, and ran into this issue and paid a serious lesson for that. Best, Shi ---------- Forwarded message ---------- From: Andreas Mueller Date: Wed, Aug 3, 2016 at 1:29 PM Subject: Re: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 To: Scikit-learn user and developer mailing list Hi Shi. In general, there is no guarantee that models built with one version will work in a different version. In particular, loading in an older version when built in a newer version seems something that's tricky to achieve. We might want to warn the user when doing this. The docs are not very explicit about this. Opened an issue: https://github.com/scikit-learn/scikit-learn/issues/7135 Andy On 08/02/2016 05:02 PM, Shi Yu wrote: Hello, We trained SVM models in scikit-learn 0.17 and saved it as pickle files. When loading the models back in a lower version of scikit-learn 0.15, the outputs are entirely different. Basically for binary classification problem, for the same test data, it swapped the probabilities and gave an opposite prediction. In 0.17 the probability is [0.02668825, 0.97331175] and the prediction is 1. In 0.15 the probability is [0.97331175, 0.02668825] and the prediction is 0. I wonder is anyone seeing the same issue, or it has been notified. I could provide more details for error replication if required. Best, Shi _______________________________________________ scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthieu.brucher at gmail.com Wed Aug 3 15:16:23 2016 From: matthieu.brucher at gmail.com (Matthieu Brucher) Date: Wed, 3 Aug 2016 20:16:23 +0100 Subject: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 In-Reply-To: References: Message-ID: More often than not, forward compatiblity is not possible. I don't think there are lots of companies doing so, as even backward compatibility is tricky to achieve. Even with serializing the version, if the previous version doesn't know about the additional data structures that have an impact on the model, you are screwed. I don't think there is anything you can expect for forward compatibility... Cheers, 2016-08-03 19:29 GMT+01:00 Andreas Mueller : > Hi Shi. > In general, there is no guarantee that models built with one version will > work in a different version. > In particular, loading in an older version when built in a newer version > seems something that's tricky to achieve. > > We might want to warn the user when doing this. The docs are not very > explicit about this. > > Opened an issue: > https://github.com/scikit-learn/scikit-learn/issues/7135 > > Andy > > > On 08/02/2016 05:02 PM, Shi Yu wrote: > > Hello, > > We trained SVM models in scikit-learn 0.17 and saved it as pickle files. > When loading the models back in a lower version of scikit-learn 0.15, the > outputs are entirely different. Basically for binary classification > problem, for the same test data, it swapped the probabilities and gave an > opposite prediction. In 0.17 the probability is [0.02668825, 0.97331175] > and the prediction is 1. In 0.15 the probability is [0.97331175, > 0.02668825] and the prediction is 0. > > I wonder is anyone seeing the same issue, or it has been notified. I > could provide more details for error replication if required. > > Best, > > Shi > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Information System Engineer, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dale.T.Smith at macys.com Wed Aug 3 15:09:06 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Wed, 3 Aug 2016 19:09:06 +0000 Subject: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 In-Reply-To: References: Message-ID: Use conda or a virtualenv to handle compatibility issues. Then you can control when upgrades occur. I?ve used conda with good effect to handle version issues such as yours. Otherwise, use PMML. The Data Mining Group maintains a list of PMML producers and consumers. I think there is a Python wrapper for JPMML which is what you can use for a consumer. http://dmg.org/pmml/products.html __________________________________________________________________________________________ Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Shi Yu Sent: Wednesday, August 3, 2016 3:03 PM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 ? EXT MSG: Hi Andy, Thanks for the feedback. Indeed we think it would be a good idea to enforce version persistence something like in serialVersionUID Java here. We deployed models trained on our laptop onto our clusters, and ran into this issue and paid a serious lesson for that. Best, Shi ---------- Forwarded message ---------- From: Andreas Mueller > Date: Wed, Aug 3, 2016 at 1:29 PM Subject: Re: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 To: Scikit-learn user and developer mailing list > Hi Shi. In general, there is no guarantee that models built with one version will work in a different version. In particular, loading in an older version when built in a newer version seems something that's tricky to achieve. We might want to warn the user when doing this. The docs are not very explicit about this. Opened an issue: https://github.com/scikit-learn/scikit-learn/issues/7135 Andy On 08/02/2016 05:02 PM, Shi Yu wrote: Hello, We trained SVM models in scikit-learn 0.17 and saved it as pickle files. When loading the models back in a lower version of scikit-learn 0.15, the outputs are entirely different. Basically for binary classification problem, for the same test data, it swapped the probabilities and gave an opposite prediction. In 0.17 the probability is [0.02668825, 0.97331175] and the prediction is 1. In 0.15 the probability is [0.97331175, 0.02668825] and the prediction is 0. I wonder is anyone seeing the same issue, or it has been notified. I could provide more details for error replication if required. Best, Shi _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Aug 3 15:38:39 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 3 Aug 2016 15:38:39 -0400 Subject: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 In-Reply-To: References: Message-ID: On 08/03/2016 03:16 PM, Matthieu Brucher wrote: > More often than not, forward compatiblity is not possible. I don't > think there are lots of companies doing so, as even backward > compatibility is tricky to achieve. > Even with serializing the version, if the previous version doesn't > know about the additional data structures that have an impact on the > model, you are screwed. I don't think there is anything you can expect > for forward compatibility... I think you can expect an error message instead of undefined behavior, though ;) From matthieu.brucher at gmail.com Wed Aug 3 16:13:14 2016 From: matthieu.brucher at gmail.com (Matthieu Brucher) Date: Wed, 3 Aug 2016 21:13:14 +0100 Subject: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 In-Reply-To: References: Message-ID: True! 2016-08-03 20:38 GMT+01:00 Andreas Mueller : > > > On 08/03/2016 03:16 PM, Matthieu Brucher wrote: > >> More often than not, forward compatiblity is not possible. I don't think >> there are lots of companies doing so, as even backward compatibility is >> tricky to achieve. >> Even with serializing the version, if the previous version doesn't know >> about the additional data structures that have an impact on the model, you >> are screwed. I don't think there is anything you can expect for forward >> compatibility... >> > I think you can expect an error message instead of undefined behavior, > though ;) > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Information System Engineer, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher -------------- next part -------------- An HTML attachment was scrubbed... URL: From lukejchang at gmail.com Wed Aug 3 18:47:35 2016 From: lukejchang at gmail.com (Luke Chang) Date: Wed, 3 Aug 2016 18:47:35 -0400 Subject: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15 In-Reply-To: References: Message-ID: <4034A4A6-338F-44BA-A566-56EB91825845@gmail.com> 1pmish -luke > On Aug 3, 2016, at 4:13 PM, Matthieu Brucher wrote: > > True! > > 2016-08-03 20:38 GMT+01:00 Andreas Mueller : >> >> >>> On 08/03/2016 03:16 PM, Matthieu Brucher wrote: >>> More often than not, forward compatiblity is not possible. I don't think there are lots of companies doing so, as even backward compatibility is tricky to achieve. >>> Even with serializing the version, if the previous version doesn't know about the additional data structures that have an impact on the model, you are screwed. I don't think there is anything you can expect for forward compatibility... >> I think you can expect an error message instead of undefined behavior, though ;) >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > Information System Engineer, Ph.D. > Blog: http://blog.audio-tk.com/ > LinkedIn: http://www.linkedin.com/in/matthieubrucher > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Aug 4 00:25:55 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 4 Aug 2016 14:25:55 +1000 Subject: [scikit-learn] StackOverflow Documentation Message-ID: StackOverflow has introduced its Documentation space, where scikit-learn is a covered subject: http://stackoverflow.com/documentation/scikit-learn. The project is a little interesting, and otherwise somewhat exasperating/tiring, given the overlap with our own documentation efforts, which we would like to see continually improve and maintain alignment with the codebase. Currently there seem to be two contributors. One appears to have been copy-pasting official scikit-learn documentation, while the other has produced original material. From a license perspective, copy-pasted material might be okay with attribution and reference to a BSD licence, with the assumption that it is then double-licensed (BSD and CC-BY-SA) if copied from SO. But I assume that copying without attribution is actually plagiarism and should be reverted, while we should discourage copying with attribution: if SO Documentation for scikit-learn has its place, it should be different to the official reference...? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Thu Aug 4 01:13:25 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Thu, 4 Aug 2016 01:13:25 -0400 Subject: [scikit-learn] StackOverflow Documentation In-Reply-To: References: Message-ID: Hm, that?s an ?interesting? approach by SO, I guess their idea is to build a collection of code-and-example based snippets for less well-documented libraries ? especially, libraries that want to keep their documentation lean. > But I assume that copying without attribution is actually plagiarism and should be reverted, as far as I know, you are right regarding BSD. In this scikit-learn case, it seems more like that these users are merely ?farming? for SO points and rep by reposting scikit-learn documentation. In my opinion, the polite way to go about it is to just comment as a scikit-learn dev saying that these reposts are okay under the BSD license but that a contribution to the original source needs to be added since it violates the copyright otherwise ? like you mentioned ? and adding a nice message encouraging these users to make suggestions and improvements to the original docs. (and if nothing changes after xx days, I would report it to SO). > On Aug 4, 2016, at 12:25 AM, Joel Nothman wrote: > > StackOverflow has introduced its Documentation space, where scikit-learn is a covered subject: http://stackoverflow.com/documentation/scikit-learn. The project is a little interesting, and otherwise somewhat exasperating/tiring, given the overlap with our own documentation efforts, which we would like to see continually improve and maintain alignment with the codebase. > > Currently there seem to be two contributors. One appears to have been copy-pasting official scikit-learn documentation, while the other has produced original material. From a license perspective, copy-pasted material might be okay with attribution and reference to a BSD licence, with the assumption that it is then double-licensed (BSD and CC-BY-SA) if copied from SO. > > But I assume that copying without attribution is actually plagiarism and should be reverted, while we should discourage copying with attribution: if SO Documentation for scikit-learn has its place, it should be different to the official reference...? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Thu Aug 4 02:29:26 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 4 Aug 2016 08:29:26 +0200 Subject: [scikit-learn] StackOverflow Documentation In-Reply-To: References: Message-ID: <20160804062926.GB2146765@phare.normalesup.org> > In this scikit-learn case, it seems more like that these users are merely ?farming? for SO points and rep by reposting scikit-learn documentation. In my opinion, the polite way to go about it is to just comment as a scikit-learn dev saying that these reposts are okay under the BSD license but that a contribution to the original source needs to be added since it violates the copyright otherwise ? like you mentioned ? and adding a nice message encouraging these users to make suggestions and improvements to the original docs. (and if nothing changes after xx days, I would report it to SO). +1 From dmoisset at machinalis.com Thu Aug 4 07:40:37 2016 From: dmoisset at machinalis.com (Daniel Moisset) Date: Thu, 4 Aug 2016 12:40:37 +0100 Subject: [scikit-learn] Is there any official position on PEP484/mypy? In-Reply-To: References: <20160728164339.GD2110660@phare.normalesup.org> <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com> <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com> <20160729195718.GO787902@phare.normalesup.org> <20160802174822.GD1269350@phare.normalesup.org> Message-ID: If the dependency is really a showstopper, bundling could be an option. The module is a single, pure python file so that shouldn't complicate things much. @Joel, regarding ?without ndarray/sparse matrix type support, we're not going to be able to annotate most of our code in sufficient detail? That shouldn't be a problem, we have already written some working support for numpy at https://github.com/machinalis/mypy-data, so it's possible no annotate ndarrays and matrix types (scipy.sparse is not covered yet, I could take a look into that). Best, D. On Tue, Aug 2, 2016 at 7:12 PM, Andreas Mueller wrote: > > > On 08/02/2016 01:48 PM, Gael Varoquaux wrote: > >> * One relevant consequence is that, to add annotations on the code, >>> scikit-learn should depend on the "typing"[1] module which contains some >>> of the >>> basic names imported and used in annotations. It's a stdlib module in >>> python >>> 3.5, but the PyPI package backports it to python 2.7 and newer (I'm not >>> sure >>> how it works with Python 2.6, which might be an issue) >>> >> I am afraid that this is going to be a problem: we have a no dependency >> policy (beyond numpy and scipy). >> > I still think this is a point we should discuss further ;) > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Daniel F. Moisset - UK Country Manager www.machinalis.com Skype: @dmoisset -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Thu Aug 4 08:11:50 2016 From: vaggi.federico at gmail.com (federico vaggi) Date: Thu, 04 Aug 2016 12:11:50 +0000 Subject: [scikit-learn] Is there any official position on PEP484/mypy? In-Reply-To: References: <20160728164339.GD2110660@phare.normalesup.org> <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com> <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com> <20160729195718.GO787902@phare.normalesup.org> <20160802174822.GD1269350@phare.normalesup.org> Message-ID: Another point about the dependency: the dependency is not required for run time - it is only required to run the type checker. You could easily put it in a try/catch block and people running scikit-learn wouldn't need it. On Thu, 4 Aug 2016 at 13:41 Daniel Moisset wrote: > If the dependency is really a showstopper, bundling could be an option. > The module is a single, pure python file so that shouldn't complicate > things much. > > @Joel, regarding > ?without ndarray/sparse matrix type support, we're not going to be able > to annotate most of our code in sufficient detail? > > That shouldn't be a problem, we have already written some working support > for numpy at https://github.com/machinalis/mypy-data, so it's possible no > annotate ndarrays and matrix types (scipy.sparse is not covered yet, I > could take a look into that). > > Best, > D. > > On Tue, Aug 2, 2016 at 7:12 PM, Andreas Mueller wrote: > >> >> >> On 08/02/2016 01:48 PM, Gael Varoquaux wrote: >> >>> * One relevant consequence is that, to add annotations on the code, >>>> scikit-learn should depend on the "typing"[1] module which contains >>>> some of the >>>> basic names imported and used in annotations. It's a stdlib module in >>>> python >>>> 3.5, but the PyPI package backports it to python 2.7 and newer (I'm not >>>> sure >>>> how it works with Python 2.6, which might be an issue) >>>> >>> I am afraid that this is going to be a problem: we have a no dependency >>> policy (beyond numpy and scipy). >>> >> I still think this is a point we should discuss further ;) >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > -- > Daniel F. Moisset - UK Country Manager > www.machinalis.com > Skype: @dmoisset > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dale.T.Smith at macys.com Thu Aug 4 08:20:48 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Thu, 4 Aug 2016 12:20:48 +0000 Subject: [scikit-learn] StackOverflow Documentation In-Reply-To: <20160804062926.GB2146765@phare.normalesup.org> References: <20160804062926.GB2146765@phare.normalesup.org> Message-ID: Perhaps a comment to that effect at StackOverflow Documentation would be helpful. I support the SO effort. I think it provides an opportunity to introduce examples and tips that aren't in the tutorials or user's guide. However, my own position is I would like to contribute to the official sklearn site - if I could get clearance on the legal side. Sigh. __________________________________________________________________________________________ Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning |?5985 State Bridge Road, Johns Creek, GA 30097?|?dale.t.smith at macys.com -----Original Message----- From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Gael Varoquaux Sent: Thursday, August 4, 2016 2:29 AM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] StackOverflow Documentation ? EXT MSG: > In this scikit-learn case, it seems more like that these users are merely ?farming? for SO points and rep by reposting scikit-learn documentation. In my opinion, the polite way to go about it is to just comment as a scikit-learn dev saying that these reposts are okay under the BSD license but that a contribution to the original source needs to be added since it violates the copyright otherwise ? like you mentioned ? and adding a nice message encouraging these users to make suggestions and improvements to the original docs. (and if nothing changes after xx days, I would report it to SO). +1 _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. From basilbeirouti at gmail.com Thu Aug 4 13:07:17 2016 From: basilbeirouti at gmail.com (Basil Beirouti) Date: Thu, 4 Aug 2016 12:07:17 -0500 Subject: [scikit-learn] BM25 Pull Request Message-ID: Hi all, Just sending an email for visibility. I've made a pull request to add Bm25 capabilities to complement TFIDF in feature_extraction.text. All tests pass. Sincerely, Basil Beirouti -------------- next part -------------- An HTML attachment was scrubbed... URL: From amisra2 at ucsc.edu Thu Aug 4 17:17:29 2016 From: amisra2 at ucsc.edu (Amita Misra) Date: Thu, 4 Aug 2016 14:17:29 -0700 Subject: [scikit-learn] Supervised anomaly detection in time series Message-ID: Hi, I am currently exploring the problem of speed bump detection using accelerometer time series data. I have extracted some features based on mean, std deviation etc within a time window. Since the dataset is highly skewed ( I have just 5 positive samples for every > 300 samples) I was looking into One ClassSVM covariance.EllipticEnvelope sklearn.ensemble.IsolationForest but I am not sure how to use them. What I get from docs separate the positive examples and train using only negative examples clf.fit(X_train) and then predict the positive examples using clf.predict(X_test) I am not sure what is then the role of positive examples in my training dataset or how can I use them to improve my classifier so that I can predict better on new samples. Can we do something like Cross validation to learn the parameters as in normal binary SVM classification Thanks,? Amita Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz -------------- next part -------------- An HTML attachment was scrubbed... URL: From goix.nicolas at gmail.com Thu Aug 4 19:43:03 2016 From: goix.nicolas at gmail.com (Nicolas Goix) Date: Thu, 4 Aug 2016 19:43:03 -0400 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: Hi, Yes you can use your labeled data (you will need to sub-sample your normal class to have similar proportion normal-abnormal) to learn your hyper-parameters through CV. You can also try to use supervised classification algorithms on `not too highly unbalanced' sub-samples. Nicolas On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra wrote: > Hi, > > I am currently exploring the problem of speed bump detection using > accelerometer time series data. > I have extracted some features based on mean, std deviation etc within a > time window. > > Since the dataset is highly skewed ( I have just 5 positive samples for > every > 300 samples) > I was looking into > > One ClassSVM > covariance.EllipticEnvelope > sklearn.ensemble.IsolationForest > > but I am not sure how to use them. > > What I get from docs > separate the positive examples and train using only negative examples > > clf.fit(X_train) > > and then > predict the positive examples using > clf.predict(X_test) > > > I am not sure what is then the role of positive examples in my training > dataset or how can I use them to improve my classifier so that I can > predict better on new samples. > > > Can we do something like Cross validation to learn the parameters as in > normal binary SVM classification > > Thanks,? > Amita > > Amita Misra > Graduate Student Researcher > Natural Language and Dialogue Systems Lab > Baskin School of Engineering > University of California Santa Cruz > > > > > > -- > Amita Misra > Graduate Student Researcher > Natural Language and Dialogue Systems Lab > Baskin School of Engineering > University of California Santa Cruz > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amisra2 at ucsc.edu Thu Aug 4 19:48:54 2016 From: amisra2 at ucsc.edu (Amita Misra) Date: Thu, 4 Aug 2016 16:48:54 -0700 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: SubSample would remove a lot of information from the negative class. I have more than 500 samples of negative class and just 5 samples of positive class. Amita On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix wrote: > Hi, > > Yes you can use your labeled data (you will need to sub-sample your normal > class to have similar proportion normal-abnormal) to learn your > hyper-parameters through CV. > > You can also try to use supervised classification algorithms on `not too > highly unbalanced' sub-samples. > > Nicolas > > On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra wrote: > >> Hi, >> >> I am currently exploring the problem of speed bump detection using >> accelerometer time series data. >> I have extracted some features based on mean, std deviation etc within a >> time window. >> >> Since the dataset is highly skewed ( I have just 5 positive samples for >> every > 300 samples) >> I was looking into >> >> One ClassSVM >> covariance.EllipticEnvelope >> sklearn.ensemble.IsolationForest >> >> but I am not sure how to use them. >> >> What I get from docs >> separate the positive examples and train using only negative examples >> >> clf.fit(X_train) >> >> and then >> predict the positive examples using >> clf.predict(X_test) >> >> >> I am not sure what is then the role of positive examples in my training >> dataset or how can I use them to improve my classifier so that I can >> predict better on new samples. >> >> >> Can we do something like Cross validation to learn the parameters as in >> normal binary SVM classification >> >> Thanks,? >> Amita >> >> Amita Misra >> Graduate Student Researcher >> Natural Language and Dialogue Systems Lab >> Baskin School of Engineering >> University of California Santa Cruz >> >> >> >> >> >> -- >> Amita Misra >> Graduate Student Researcher >> Natural Language and Dialogue Systems Lab >> Baskin School of Engineering >> University of California Santa Cruz >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz -------------- next part -------------- An HTML attachment was scrubbed... URL: From goix.nicolas at gmail.com Thu Aug 4 20:23:28 2016 From: goix.nicolas at gmail.com (Nicolas Goix) Date: Thu, 4 Aug 2016 20:23:28 -0400 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: You can evaluate the accuracy of your hyper-parameters on a few samples. Just don't use the accuracy as your performance measure. For supervised classification, training multiple algorithms on small balanced subsamples usually works well, but 5 anomalies seems indeed to be very little. Nicolas On Aug 4, 2016 7:51 PM, "Amita Misra" wrote: > SubSample would remove a lot of information from the negative class. > I have more than 500 samples of negative class and just 5 samples of > positive class. > > Amita > > On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix > wrote: > >> Hi, >> >> Yes you can use your labeled data (you will need to sub-sample your >> normal class to have similar proportion normal-abnormal) to learn your >> hyper-parameters through CV. >> >> You can also try to use supervised classification algorithms on `not too >> highly unbalanced' sub-samples. >> >> Nicolas >> >> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra wrote: >> >>> Hi, >>> >>> I am currently exploring the problem of speed bump detection using >>> accelerometer time series data. >>> I have extracted some features based on mean, std deviation etc within >>> a time window. >>> >>> Since the dataset is highly skewed ( I have just 5 positive samples for >>> every > 300 samples) >>> I was looking into >>> >>> One ClassSVM >>> covariance.EllipticEnvelope >>> sklearn.ensemble.IsolationForest >>> >>> but I am not sure how to use them. >>> >>> What I get from docs >>> separate the positive examples and train using only negative examples >>> >>> clf.fit(X_train) >>> >>> and then >>> predict the positive examples using >>> clf.predict(X_test) >>> >>> >>> I am not sure what is then the role of positive examples in my training >>> dataset or how can I use them to improve my classifier so that I can >>> predict better on new samples. >>> >>> >>> Can we do something like Cross validation to learn the parameters as in >>> normal binary SVM classification >>> >>> Thanks,? >>> Amita >>> >>> Amita Misra >>> Graduate Student Researcher >>> Natural Language and Dialogue Systems Lab >>> Baskin School of Engineering >>> University of California Santa Cruz >>> >>> >>> >>> >>> >>> -- >>> Amita Misra >>> Graduate Student Researcher >>> Natural Language and Dialogue Systems Lab >>> Baskin School of Engineering >>> University of California Santa Cruz >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Amita Misra > Graduate Student Researcher > Natural Language and Dialogue Systems Lab > Baskin School of Engineering > University of California Santa Cruz > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amisra2 at ucsc.edu Thu Aug 4 20:42:25 2016 From: amisra2 at ucsc.edu (Amita Misra) Date: Thu, 4 Aug 2016 17:42:25 -0700 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: If I train multiple algorithms on different subsamples, then how do I get the final classifier that predicts unseen data? I have very few positive samples since it is speed bump detection and we have very few speed bumps in a drive. However, I think that unseen new data would be quite similar to what I have in training data hence if I can correctly learn a classifier for these 5, I hope it should work well for unseen speed bumps. Thanks, Amita On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix wrote: > You can evaluate the accuracy of your hyper-parameters on a few samples. > Just don't use the accuracy as your performance measure. > > For supervised classification, training multiple algorithms on small > balanced subsamples usually works well, but 5 anomalies seems indeed to be > very little. > > Nicolas > > On Aug 4, 2016 7:51 PM, "Amita Misra" wrote: > >> SubSample would remove a lot of information from the negative class. >> I have more than 500 samples of negative class and just 5 samples of >> positive class. >> >> Amita >> >> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix >> wrote: >> >>> Hi, >>> >>> Yes you can use your labeled data (you will need to sub-sample your >>> normal class to have similar proportion normal-abnormal) to learn your >>> hyper-parameters through CV. >>> >>> You can also try to use supervised classification algorithms on `not too >>> highly unbalanced' sub-samples. >>> >>> Nicolas >>> >>> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra wrote: >>> >>>> Hi, >>>> >>>> I am currently exploring the problem of speed bump detection using >>>> accelerometer time series data. >>>> I have extracted some features based on mean, std deviation etc within >>>> a time window. >>>> >>>> Since the dataset is highly skewed ( I have just 5 positive samples >>>> for every > 300 samples) >>>> I was looking into >>>> >>>> One ClassSVM >>>> covariance.EllipticEnvelope >>>> sklearn.ensemble.IsolationForest >>>> >>>> but I am not sure how to use them. >>>> >>>> What I get from docs >>>> separate the positive examples and train using only negative examples >>>> >>>> clf.fit(X_train) >>>> >>>> and then >>>> predict the positive examples using >>>> clf.predict(X_test) >>>> >>>> >>>> I am not sure what is then the role of positive examples in my training >>>> dataset or how can I use them to improve my classifier so that I can >>>> predict better on new samples. >>>> >>>> >>>> Can we do something like Cross validation to learn the parameters as in >>>> normal binary SVM classification >>>> >>>> Thanks,? >>>> Amita >>>> >>>> Amita Misra >>>> Graduate Student Researcher >>>> Natural Language and Dialogue Systems Lab >>>> Baskin School of Engineering >>>> University of California Santa Cruz >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Amita Misra >>>> Graduate Student Researcher >>>> Natural Language and Dialogue Systems Lab >>>> Baskin School of Engineering >>>> University of California Santa Cruz >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Amita Misra >> Graduate Student Researcher >> Natural Language and Dialogue Systems Lab >> Baskin School of Engineering >> University of California Santa Cruz >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz -------------- next part -------------- An HTML attachment was scrubbed... URL: From goix.nicolas at gmail.com Thu Aug 4 21:12:40 2016 From: goix.nicolas at gmail.com (Nicolas Goix) Date: Thu, 4 Aug 2016 21:12:40 -0400 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: There are different ways of aggregating estimators. A possibility can be to take the majority vote, or averaging decision functions. On Aug 4, 2016 8:44 PM, "Amita Misra" wrote: > If I train multiple algorithms on different subsamples, then how do I get > the final classifier that predicts unseen data? > > > I have very few positive samples since it is speed bump detection and we > have very few speed bumps in a drive. > However, I think that unseen new data would be quite similar to what I > have in training data hence if I can correctly learn a classifier for these > 5, I hope it should work well for unseen speed bumps. > > Thanks, > Amita > > On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix > wrote: > >> You can evaluate the accuracy of your hyper-parameters on a few samples. >> Just don't use the accuracy as your performance measure. >> >> For supervised classification, training multiple algorithms on small >> balanced subsamples usually works well, but 5 anomalies seems indeed to be >> very little. >> >> Nicolas >> >> On Aug 4, 2016 7:51 PM, "Amita Misra" wrote: >> >>> SubSample would remove a lot of information from the negative class. >>> I have more than 500 samples of negative class and just 5 samples of >>> positive class. >>> >>> Amita >>> >>> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix >>> wrote: >>> >>>> Hi, >>>> >>>> Yes you can use your labeled data (you will need to sub-sample your >>>> normal class to have similar proportion normal-abnormal) to learn your >>>> hyper-parameters through CV. >>>> >>>> You can also try to use supervised classification algorithms on `not >>>> too highly unbalanced' sub-samples. >>>> >>>> Nicolas >>>> >>>> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra wrote: >>>> >>>>> Hi, >>>>> >>>>> I am currently exploring the problem of speed bump detection using >>>>> accelerometer time series data. >>>>> I have extracted some features based on mean, std deviation etc >>>>> within a time window. >>>>> >>>>> Since the dataset is highly skewed ( I have just 5 positive samples >>>>> for every > 300 samples) >>>>> I was looking into >>>>> >>>>> One ClassSVM >>>>> covariance.EllipticEnvelope >>>>> sklearn.ensemble.IsolationForest >>>>> >>>>> but I am not sure how to use them. >>>>> >>>>> What I get from docs >>>>> separate the positive examples and train using only negative examples >>>>> >>>>> clf.fit(X_train) >>>>> >>>>> and then >>>>> predict the positive examples using >>>>> clf.predict(X_test) >>>>> >>>>> >>>>> I am not sure what is then the role of positive examples in my >>>>> training dataset or how can I use them to improve my classifier so that I >>>>> can predict better on new samples. >>>>> >>>>> >>>>> Can we do something like Cross validation to learn the parameters as >>>>> in normal binary SVM classification >>>>> >>>>> Thanks,? >>>>> Amita >>>>> >>>>> Amita Misra >>>>> Graduate Student Researcher >>>>> Natural Language and Dialogue Systems Lab >>>>> Baskin School of Engineering >>>>> University of California Santa Cruz >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Amita Misra >>>>> Graduate Student Researcher >>>>> Natural Language and Dialogue Systems Lab >>>>> Baskin School of Engineering >>>>> University of California Santa Cruz >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> Amita Misra >>> Graduate Student Researcher >>> Natural Language and Dialogue Systems Lab >>> Baskin School of Engineering >>> University of California Santa Cruz >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Amita Misra > Graduate Student Researcher > Natural Language and Dialogue Systems Lab > Baskin School of Engineering > University of California Santa Cruz > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dale.T.Smith at macys.com Fri Aug 5 08:26:01 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Fri, 5 Aug 2016 12:26:01 +0000 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: I don?t think you should treat this as an outlier detection problem. Why not try it as a classification problem? The dataset is highly unbalanced. Try http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html Use sample_weight to tell the fit method about the class imbalance. But be sure to read up about unbalanced classification and the class_weight parameter to ExtraTreesClassifier. You cannot use the accuracy to find the best model, so read up on model validation in the sklearn User?s Guide. And when you do cross-validation to get the best hyperparameters, be sure you pass the sample weights as well. Time series data is a bit different to use with cross-validation. You may want to add features such as minutes since midnight, day of week, weekday/weekend. And make sure your cross-validation folds respect the time series nature of the problem. http://stackoverflow.com/questions/37583263/scikit-learn-cross-validation-custom-splits-for-time-series-data __________________________________________________________________________________________ Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Nicolas Goix Sent: Thursday, August 4, 2016 9:13 PM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Supervised anomaly detection in time series ? EXT MSG: There are different ways of aggregating estimators. A possibility can be to take the majority vote, or averaging decision functions. On Aug 4, 2016 8:44 PM, "Amita Misra" > wrote: If I train multiple algorithms on different subsamples, then how do I get the final classifier that predicts unseen data? I have very few positive samples since it is speed bump detection and we have very few speed bumps in a drive. However, I think that unseen new data would be quite similar to what I have in training data hence if I can correctly learn a classifier for these 5, I hope it should work well for unseen speed bumps. Thanks, Amita On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix > wrote: You can evaluate the accuracy of your hyper-parameters on a few samples. Just don't use the accuracy as your performance measure. For supervised classification, training multiple algorithms on small balanced subsamples usually works well, but 5 anomalies seems indeed to be very little. Nicolas On Aug 4, 2016 7:51 PM, "Amita Misra" > wrote: SubSample would remove a lot of information from the negative class. I have more than 500 samples of negative class and just 5 samples of positive class. Amita On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix > wrote: Hi, Yes you can use your labeled data (you will need to sub-sample your normal class to have similar proportion normal-abnormal) to learn your hyper-parameters through CV. You can also try to use supervised classification algorithms on `not too highly unbalanced' sub-samples. Nicolas On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra > wrote: Hi, I am currently exploring the problem of speed bump detection using accelerometer time series data. I have extracted some features based on mean, std deviation etc within a time window. Since the dataset is highly skewed ( I have just 5 positive samples for every > 300 samples) I was looking into One ClassSVM covariance.EllipticEnvelope sklearn.ensemble.IsolationForest but I am not sure how to use them. What I get from docs separate the positive examples and train using only negative examples clf.fit(X_train) and then predict the positive examples using clf.predict(X_test) I am not sure what is then the role of positive examples in my training dataset or how can I use them to improve my classifier so that I can predict better on new samples. Can we do something like Cross validation to learn the parameters as in normal binary SVM classification Thanks,? Amita Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedropazzini at gmail.com Fri Aug 5 09:32:52 2016 From: pedropazzini at gmail.com (Pedro Pazzini) Date: Fri, 5 Aug 2016 10:32:52 -0300 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: Just to add a few things to the discussion: 1. For unbalanced problems, as far as I know, one of the best scores to evaluate a classifier is the Area Under the ROC curve: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html. For that you will have to use clf.predict_proba(X_test) instead of clf.predict(X_test). I think that using the 'sample_weight' parameter as Smith said is a promising choice. 2. Usually is recommend the normalization of each time series for comparing them. The Z-score normalization is one of the most used [Ref: http://wan.poly.edu/KDD2012/docs/p262.pdf]. 3. There are some interesting dissimilarity measures such as DTW (Dynamic Time Warping), CID (Complex Invariant Distance), and others for comparing time series[Ref: https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf]. And there are also other approaches for comparing time series in the frequency domain such as FFT and DWT [Ref: http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf ]. I hope it helps. 2016-08-05 9:26 GMT-03:00 Dale T Smith : > I don?t think you should treat this as an outlier detection problem. Why > not try it as a classification problem? The dataset is highly unbalanced. > Try > > > > http://scikit-learn.org/stable/modules/generated/sklearn.ensemble. > ExtraTreesClassifier.html > > > > Use sample_weight to tell the fit method about the class imbalance. But be > sure to read up about unbalanced classification and the class_weight > parameter to ExtraTreesClassifier. You cannot use the accuracy to find the > best model, so read up on model validation in the sklearn User?s Guide. And > when you do cross-validation to get the best hyperparameters, be sure you > pass the sample weights as well. > > > > Time series data is a bit different to use with cross-validation. You may > want to add features such as minutes since midnight, day of week, > weekday/weekend. And make sure your cross-validation folds respect the time > series nature of the problem. > > > > http://stackoverflow.com/questions/37583263/scikit- > learn-cross-validation-custom-splits-for-time-series-data > > > > > > ____________________________________________________________ > ______________________________ > *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data > Science and Capacity Planning > | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com > > > > *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= > macys.com at python.org] *On Behalf Of *Nicolas Goix > *Sent:* Thursday, August 4, 2016 9:13 PM > *To:* Scikit-learn user and developer mailing list > *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series > > > > ? EXT MSG: > > There are different ways of aggregating estimators. A possibility can be > to take the majority vote, or averaging decision functions. > > > > On Aug 4, 2016 8:44 PM, "Amita Misra" wrote: > > If I train multiple algorithms on different subsamples, then how do I get > the final classifier that predicts unseen data? > > I have very few positive samples since it is speed bump detection and we > have very few speed bumps in a drive. > However, I think that unseen new data would be quite similar to what I > have in training data hence if I can correctly learn a classifier for these > 5, I hope it should work well for unseen speed bumps. > > Thanks, > Amita > > > > On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix > wrote: > > You can evaluate the accuracy of your hyper-parameters on a few samples. > Just don't use the accuracy as your performance measure. > > For supervised classification, training multiple algorithms on small > balanced subsamples usually works well, but 5 anomalies seems indeed to be > very little. > > Nicolas > > > > On Aug 4, 2016 7:51 PM, "Amita Misra" wrote: > > SubSample would remove a lot of information from the negative class. > > I have more than 500 samples of negative class and just 5 samples of > positive class. > > Amita > > > > On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix > wrote: > > Hi, > > > > Yes you can use your labeled data (you will need to sub-sample your normal > class to have similar proportion normal-abnormal) to learn your > hyper-parameters through CV. > > > > You can also try to use supervised classification algorithms on `not too > highly unbalanced' sub-samples. > > > > Nicolas > > > > On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra wrote: > > Hi, > > > > I am currently exploring the problem of speed bump detection using > accelerometer time series data. > > I have extracted some features based on mean, std deviation etc within a > time window. > > Since the dataset is highly skewed ( I have just 5 positive samples for > every > 300 samples) > > I was looking into > > One ClassSVM > covariance.EllipticEnvelope > sklearn.ensemble.IsolationForest > > but I am not sure how to use them. > > What I get from docs > > separate the positive examples and train using only negative examples > > clf.fit(X_train) > > and then > predict the positive examples using > clf.predict(X_test) > > > I am not sure what is then the role of positive examples in my training > dataset or how can I use them to improve my classifier so that I can > predict better on new samples. > > Can we do something like Cross validation to learn the parameters as in > normal binary SVM classification > > > > Thanks,? > > Amita > > > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > > > > > -- > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or > opening attachments. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dale.T.Smith at macys.com Fri Aug 5 10:09:11 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Fri, 5 Aug 2016 14:09:11 +0000 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: To analyze unbalanced classifiers, use from sklearn.metrics import classification_report __________________________________________________________________________________________ Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Pedro Pazzini Sent: Friday, August 5, 2016 9:33 AM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Supervised anomaly detection in time series ? EXT MSG: Just to add a few things to the discussion: 1. For unbalanced problems, as far as I know, one of the best scores to evaluate a classifier is the Area Under the ROC curve: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html. For that you will have to use clf.predict_proba(X_test) instead of clf.predict(X_test). I think that using the 'sample_weight' parameter as Smith said is a promising choice. 2. Usually is recommend the normalization of each time series for comparing them. The Z-score normalization is one of the most used [Ref: http://wan.poly.edu/KDD2012/docs/p262.pdf]. 3. There are some interesting dissimilarity measures such as DTW (Dynamic Time Warping), CID (Complex Invariant Distance), and others for comparing time series[Ref: https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf]. And there are also other approaches for comparing time series in the frequency domain such as FFT and DWT [Ref: http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf]. I hope it helps. 2016-08-05 9:26 GMT-03:00 Dale T Smith >: I don?t think you should treat this as an outlier detection problem. Why not try it as a classification problem? The dataset is highly unbalanced. Try http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html Use sample_weight to tell the fit method about the class imbalance. But be sure to read up about unbalanced classification and the class_weight parameter to ExtraTreesClassifier. You cannot use the accuracy to find the best model, so read up on model validation in the sklearn User?s Guide. And when you do cross-validation to get the best hyperparameters, be sure you pass the sample weights as well. Time series data is a bit different to use with cross-validation. You may want to add features such as minutes since midnight, day of week, weekday/weekend. And make sure your cross-validation folds respect the time series nature of the problem. http://stackoverflow.com/questions/37583263/scikit-learn-cross-validation-custom-splits-for-time-series-data __________________________________________________________________________________________ Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Nicolas Goix Sent: Thursday, August 4, 2016 9:13 PM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Supervised anomaly detection in time series ? EXT MSG: There are different ways of aggregating estimators. A possibility can be to take the majority vote, or averaging decision functions. On Aug 4, 2016 8:44 PM, "Amita Misra" > wrote: If I train multiple algorithms on different subsamples, then how do I get the final classifier that predicts unseen data? I have very few positive samples since it is speed bump detection and we have very few speed bumps in a drive. However, I think that unseen new data would be quite similar to what I have in training data hence if I can correctly learn a classifier for these 5, I hope it should work well for unseen speed bumps. Thanks, Amita On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix > wrote: You can evaluate the accuracy of your hyper-parameters on a few samples. Just don't use the accuracy as your performance measure. For supervised classification, training multiple algorithms on small balanced subsamples usually works well, but 5 anomalies seems indeed to be very little. Nicolas On Aug 4, 2016 7:51 PM, "Amita Misra" > wrote: SubSample would remove a lot of information from the negative class. I have more than 500 samples of negative class and just 5 samples of positive class. Amita On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix > wrote: Hi, Yes you can use your labeled data (you will need to sub-sample your normal class to have similar proportion normal-abnormal) to learn your hyper-parameters through CV. You can also try to use supervised classification algorithms on `not too highly unbalanced' sub-samples. Nicolas On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra > wrote: Hi, I am currently exploring the problem of speed bump detection using accelerometer time series data. I have extracted some features based on mean, std deviation etc within a time window. Since the dataset is highly skewed ( I have just 5 positive samples for every > 300 samples) I was looking into One ClassSVM covariance.EllipticEnvelope sklearn.ensemble.IsolationForest but I am not sure how to use them. What I get from docs separate the positive examples and train using only negative examples clf.fit(X_train) and then predict the positive examples using clf.predict(X_test) I am not sure what is then the role of positive examples in my training dataset or how can I use them to improve my classifier so that I can predict better on new samples. Can we do something like Cross validation to learn the parameters as in normal binary SVM classification Thanks,? Amita Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From qingkai.kong at gmail.com Fri Aug 5 14:05:27 2016 From: qingkai.kong at gmail.com (Qingkai Kong) Date: Fri, 5 Aug 2016 11:05:27 -0700 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: I also worked on something similar, instead of using some algorithms deal with unbalanced data, you can also try to create a balanced dataset either using oversampling or downsampling. scikit-learn-contrib has already had a project dealing with unbalanced data: https://github.com/scikit-learn-contrib/imbalanced-learn. Either you treat it as a classification problem or anomaly detection problem (I prefer to treat it as a classification problem first) you all need to find a better set of features in time domain or frequency domain. On Fri, Aug 5, 2016 at 7:09 AM, Dale T Smith wrote: > To analyze unbalanced classifiers, use > > > > from sklearn.metrics import classification_report > > > > > > ____________________________________________________________ > ______________________________ > *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data > Science and Capacity Planning > | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com > > > > *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= > macys.com at python.org] *On Behalf Of *Pedro Pazzini > *Sent:* Friday, August 5, 2016 9:33 AM > > *To:* Scikit-learn user and developer mailing list > *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series > > > > ? EXT MSG: > > Just to add a few things to the discussion: > > 1. For unbalanced problems, as far as I know, one of the best scores > to evaluate a classifier is the Area Under the ROC curve: > http://scikit-learn.org/stable/modules/generated/ > sklearn.metrics.roc_auc_score.html > . > For that you will have to use clf.predict_proba(X_test) instead of > clf.predict(X_test). I think that using the 'sample_weight' parameter as > Smith said is a promising choice. > 2. Usually is recommend the normalization of each time series for > comparing them. The Z-score normalization is one of the most used [Ref: > http://wan.poly.edu/KDD2012/docs/p262.pdf > ]. > 3. There are some interesting dissimilarity measures such as DTW > (Dynamic Time Warping), CID (Complex Invariant Distance), and others for > comparing time series[Ref: https://www.icmc.usp.br/~ > gbatista/files/bracis2013_1.pdf > ]. And there > are also other approaches for comparing time series in the frequency domain > such as FFT and DWT [Ref: http://infolab.usc.edu/csci599/Fall2003/Time% > 20Series/Efficient%20Similarity%20Search%20In% > 20Sequence%20Databases.pdf > > ]. > > I hope it helps. > > > > 2016-08-05 9:26 GMT-03:00 Dale T Smith : > > I don?t think you should treat this as an outlier detection problem. Why > not try it as a classification problem? The dataset is highly unbalanced. > Try > > > > http://scikit-learn.org/stable/modules/generated/sklearn.ensemble. > ExtraTreesClassifier.html > > > > Use sample_weight to tell the fit method about the class imbalance. But be > sure to read up about unbalanced classification and the class_weight > parameter to ExtraTreesClassifier. You cannot use the accuracy to find the > best model, so read up on model validation in the sklearn User?s Guide. And > when you do cross-validation to get the best hyperparameters, be sure you > pass the sample weights as well. > > > > Time series data is a bit different to use with cross-validation. You may > want to add features such as minutes since midnight, day of week, > weekday/weekend. And make sure your cross-validation folds respect the time > series nature of the problem. > > > > http://stackoverflow.com/questions/37583263/scikit- > learn-cross-validation-custom-splits-for-time-series-data > > > > > > ____________________________________________________________ > ______________________________ > *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data > Science and Capacity Planning > | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com > > > > *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= > macys.com at python.org] *On Behalf Of *Nicolas Goix > *Sent:* Thursday, August 4, 2016 9:13 PM > *To:* Scikit-learn user and developer mailing list > *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series > > > > ? EXT MSG: > > There are different ways of aggregating estimators. A possibility can be > to take the majority vote, or averaging decision functions. > > > > On Aug 4, 2016 8:44 PM, "Amita Misra" wrote: > > If I train multiple algorithms on different subsamples, then how do I get > the final classifier that predicts unseen data? > > I have very few positive samples since it is speed bump detection and we > have very few speed bumps in a drive. > However, I think that unseen new data would be quite similar to what I > have in training data hence if I can correctly learn a classifier for these > 5, I hope it should work well for unseen speed bumps. > > Thanks, > Amita > > > > On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix > wrote: > > You can evaluate the accuracy of your hyper-parameters on a few samples. > Just don't use the accuracy as your performance measure. > > For supervised classification, training multiple algorithms on small > balanced subsamples usually works well, but 5 anomalies seems indeed to be > very little. > > Nicolas > > > > On Aug 4, 2016 7:51 PM, "Amita Misra" wrote: > > SubSample would remove a lot of information from the negative class. > > I have more than 500 samples of negative class and just 5 samples of > positive class. > > Amita > > > > On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix > wrote: > > Hi, > > > > Yes you can use your labeled data (you will need to sub-sample your normal > class to have similar proportion normal-abnormal) to learn your > hyper-parameters through CV. > > > > You can also try to use supervised classification algorithms on `not too > highly unbalanced' sub-samples. > > > > Nicolas > > > > On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra wrote: > > Hi, > > > > I am currently exploring the problem of speed bump detection using > accelerometer time series data. > > I have extracted some features based on mean, std deviation etc within a > time window. > > Since the dataset is highly skewed ( I have just 5 positive samples for > every > 300 samples) > > I was looking into > > One ClassSVM > covariance.EllipticEnvelope > sklearn.ensemble.IsolationForest > > but I am not sure how to use them. > > What I get from docs > > separate the positive examples and train using only negative examples > > clf.fit(X_train) > > and then > predict the positive examples using > clf.predict(X_test) > > > I am not sure what is then the role of positive examples in my training > dataset or how can I use them to improve my classifier so that I can > predict better on new samples. > > Can we do something like Cross validation to learn the parameters as in > normal binary SVM classification > > > > Thanks,? > > Amita > > > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > > > > > -- > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or > opening attachments. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or > opening attachments. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Qingkai KONG Ph.D Candidate Seismological Lab 289 McCone Hall University of California, Berkeley http://seismo.berkeley.edu/qingkaikong -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgabor.astro at gmail.com Fri Aug 5 14:55:30 2016 From: jgabor.astro at gmail.com (Jared Gabor) Date: Fri, 5 Aug 2016 11:55:30 -0700 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: Lots of great suggestions on how to model your problem. But this might be the kind of problem where you seriously ask how hard it would be to gather more data. On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra wrote: > Hi, > > I am currently exploring the problem of speed bump detection using > accelerometer time series data. > I have extracted some features based on mean, std deviation etc within a > time window. > > Since the dataset is highly skewed ( I have just 5 positive samples for > every > 300 samples) > I was looking into > > One ClassSVM > covariance.EllipticEnvelope > sklearn.ensemble.IsolationForest > > but I am not sure how to use them. > > What I get from docs > separate the positive examples and train using only negative examples > > clf.fit(X_train) > > and then > predict the positive examples using > clf.predict(X_test) > > > I am not sure what is then the role of positive examples in my training > dataset or how can I use them to improve my classifier so that I can > predict better on new samples. > > > Can we do something like Cross validation to learn the parameters as in > normal binary SVM classification > > Thanks,? > Amita > > Amita Misra > Graduate Student Researcher > Natural Language and Dialogue Systems Lab > Baskin School of Engineering > University of California Santa Cruz > > > > > > -- > Amita Misra > Graduate Student Researcher > Natural Language and Dialogue Systems Lab > Baskin School of Engineering > University of California Santa Cruz > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amisra2 at ucsc.edu Fri Aug 5 15:07:54 2016 From: amisra2 at ucsc.edu (Amita Misra) Date: Fri, 5 Aug 2016 12:07:54 -0700 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: Thanks everyone for the suggestions. Actually we thought of gathering more data but the point is we do not have many speed bumps in our driving area. If we drive over the same speed bump again and again it may not add anything really novel to the data. I think a combination of oversampling and sample_weight along with ROC may be a good start for me. Thanks, Amita On Fri, Aug 5, 2016 at 11:55 AM, Jared Gabor wrote: > Lots of great suggestions on how to model your problem. But this might be > the kind of problem where you seriously ask how hard it would be to gather > more data. > > On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra wrote: > >> Hi, >> >> I am currently exploring the problem of speed bump detection using >> accelerometer time series data. >> I have extracted some features based on mean, std deviation etc within a >> time window. >> >> Since the dataset is highly skewed ( I have just 5 positive samples for >> every > 300 samples) >> I was looking into >> >> One ClassSVM >> covariance.EllipticEnvelope >> sklearn.ensemble.IsolationForest >> >> but I am not sure how to use them. >> >> What I get from docs >> separate the positive examples and train using only negative examples >> >> clf.fit(X_train) >> >> and then >> predict the positive examples using >> clf.predict(X_test) >> >> >> I am not sure what is then the role of positive examples in my training >> dataset or how can I use them to improve my classifier so that I can >> predict better on new samples. >> >> >> Can we do something like Cross validation to learn the parameters as in >> normal binary SVM classification >> >> Thanks,? >> Amita >> >> Amita Misra >> Graduate Student Researcher >> Natural Language and Dialogue Systems Lab >> Baskin School of Engineering >> University of California Santa Cruz >> >> >> >> >> >> -- >> Amita Misra >> Graduate Student Researcher >> Natural Language and Dialogue Systems Lab >> Baskin School of Engineering >> University of California Santa Cruz >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri Aug 5 15:26:59 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 5 Aug 2016 15:26:59 -0400 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: Message-ID: <56374299-8CFE-4430-BD1A-CE93F836211D@sebastianraschka.com> > But this might be the kind of problem where you seriously ask how hard it would be to gather more data. Yeah, I agree, but this scenario is then typical in a sense of that it is an anomaly detection problem rather than a classification problem. I.e., you don?t have enough positive labels to fit the model and thus you need to do unsupervised learning to learn from the negative class only. Sure, supervised learning could work well, but I would also explore unsupervised learning here and see how that works for you; maybe one-class SVM as suggested or EM algorithm based mixture models (http://scikit-learn.org/stable/modules/mixture.html) Best, Sebastian > On Aug 5, 2016, at 2:55 PM, Jared Gabor wrote: > > Lots of great suggestions on how to model your problem. But this might be the kind of problem where you seriously ask how hard it would be to gather more data. > > On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra wrote: > Hi, > > I am currently exploring the problem of speed bump detection using accelerometer time series data. > I have extracted some features based on mean, std deviation etc within a time window. > > Since the dataset is highly skewed ( I have just 5 positive samples for every > 300 samples) > I was looking into > > One ClassSVM > covariance.EllipticEnvelope > sklearn.ensemble.IsolationForest > but I am not sure how to use them. > > What I get from docs > > separate the positive examples and train using only negative examples > clf.fit(X_train) > and then > predict the positive examples using > clf.predict(X_test) > > > I am not sure what is then the role of positive examples in my training dataset or how can I use them to improve my classifier so that I can predict better on new samples. > > > Can we do something like Cross validation to learn the parameters as in normal binary SVM classification > > Thanks,? > Amita > > Amita Misra > Graduate Student Researcher > Natural Language and Dialogue Systems Lab > Baskin School of Engineering > University of California Santa Cruz > > > > > > -- > Amita Misra > Graduate Student Researcher > Natural Language and Dialogue Systems Lab > Baskin School of Engineering > University of California Santa Cruz > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From albertthomas88 at gmail.com Fri Aug 5 19:40:30 2016 From: albertthomas88 at gmail.com (Albert Thomas) Date: Fri, 05 Aug 2016 23:40:30 +0000 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: <56374299-8CFE-4430-BD1A-CE93F836211D@sebastianraschka.com> References: <56374299-8CFE-4430-BD1A-CE93F836211D@sebastianraschka.com> Message-ID: Hi, About your question on how to learn the parameters of anomaly detection algorithms using only the negative samples in your case, Nicolas and I worked on this aspect recently. If you are interested you can have look at: - Learning hyperparameters for unsupervised anomaly detection: https://drive.google.com/file/d/0B8Dg3PBX90KNUTg5NGNOVnFPX0hDNmJsSTcybzZMSHNPYkd3/view - How to evaluate the quality of unsupervised anomaly Detection algorithms?: https://drive.google.com/file/d/0B8Dg3PBX90KNenV3WjRkR09Bakx5YlNyMF9BUXVNem1hb0NR/view Best, Albert On Fri, Aug 5, 2016 at 9:34 PM Sebastian Raschka wrote: > > But this might be the kind of problem where you seriously ask how hard > it would be to gather more data. > > > Yeah, I agree, but this scenario is then typical in a sense of that it is > an anomaly detection problem rather than a classification problem. I.e., > you don?t have enough positive labels to fit the model and thus you need to > do unsupervised learning to learn from the negative class only. > > Sure, supervised learning could work well, but I would also explore > unsupervised learning here and see how that works for you; maybe one-class > SVM as suggested or EM algorithm based mixture models ( > http://scikit-learn.org/stable/modules/mixture.html) > > Best, > Sebastian > > > On Aug 5, 2016, at 2:55 PM, Jared Gabor wrote: > > > > Lots of great suggestions on how to model your problem. But this might > be the kind of problem where you seriously ask how hard it would be to > gather more data. > > > > On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra wrote: > > Hi, > > > > I am currently exploring the problem of speed bump detection using > accelerometer time series data. > > I have extracted some features based on mean, std deviation etc within > a time window. > > > > Since the dataset is highly skewed ( I have just 5 positive samples for > every > 300 samples) > > I was looking into > > > > One ClassSVM > > covariance.EllipticEnvelope > > sklearn.ensemble.IsolationForest > > but I am not sure how to use them. > > > > What I get from docs > > > > separate the positive examples and train using only negative examples > > clf.fit(X_train) > > and then > > predict the positive examples using > > clf.predict(X_test) > > > > > > I am not sure what is then the role of positive examples in my training > dataset or how can I use them to improve my classifier so that I can > predict better on new samples. > > > > > > Can we do something like Cross validation to learn the parameters as in > normal binary SVM classification > > > > Thanks,? > > Amita > > > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > > > > > > > > -- > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ionescu.vlad1 at gmail.com Sun Aug 7 04:42:03 2016 From: ionescu.vlad1 at gmail.com (Vlad Ionescu) Date: Sun, 07 Aug 2016 08:42:03 +0000 Subject: [scikit-learn] Scaling model selection on a cluster Message-ID: Hello, I am interested in scaling grid searches on an HPC LSF cluster with about 60 nodes, each with 20 cores. I thought i could just set n_jobs=1000 then submit a job with bsub -n 1000, but then I dug deeper and understood that the underlying joblib used by scikit-learn will create all of those jobs on a single node, resulting in no performance benefits. So I am stuck using a single node. I've read a lengthy discussion some time ago about adding something like this in scikit-learn: https://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/4F26C3CB.8070603 at ais.uni-bonn.de/ However, it hasn't materialized in any way, as far as I can tell. Do you know of any way to do this, or any modern cluster computing libraries for python that might help me write something myself (I found a lot, but it's hard to tell what's considered good or even still under development)? Also, are there still plans to implement this in scikit-learn? You seemed to like the idea back then. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Sun Aug 7 05:05:41 2016 From: vaggi.federico at gmail.com (federico vaggi) Date: Sun, 07 Aug 2016 09:05:41 +0000 Subject: [scikit-learn] Scaling model selection on a cluster In-Reply-To: References: Message-ID: This might be interesting to you: http://blaze.pydata.org/blog/2015/10/19/dask-learn/ On Sun, 7 Aug 2016 at 10:42 Vlad Ionescu wrote: > Hello, > > I am interested in scaling grid searches on an HPC LSF cluster with about > 60 nodes, each with 20 cores. I thought i could just set n_jobs=1000 then > submit a job with bsub -n 1000, but then I dug deeper and understood that > the underlying joblib used by scikit-learn will create all of those jobs on > a single node, resulting in no performance benefits. So I am stuck using a > single node. > > I've read a lengthy discussion some time ago about adding something like > this in scikit-learn: > https://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/4F26C3CB.8070603 at ais.uni-bonn.de/ > > > However, it hasn't materialized in any way, as far as I can tell. > > Do you know of any way to do this, or any modern cluster computing > libraries for python that might help me write something myself (I found a > lot, but it's hard to tell what's considered good or even still under > development)? > > Also, are there still plans to implement this in scikit-learn? You seemed > to like the idea back then. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ionescu.vlad1 at gmail.com Sun Aug 7 06:51:32 2016 From: ionescu.vlad1 at gmail.com (Vlad Ionescu) Date: Sun, 07 Aug 2016 10:51:32 +0000 Subject: [scikit-learn] Scaling model selection on a cluster In-Reply-To: References: Message-ID: Thanks, that looks interesting. I've looked into dask-learn's grid search ( https://github.com/mrocklin/dask-learn/blob/master/grid_search.py) but it seems not to make use of the n_jobs parameter. Will this work in a distributed fashion? The link you gave seemed to focus more on optimizing the grid search by eliminating duplicate work rather than by distributing it on more machines (I am actually using a random search, so I'm not sure those optimizations apply to my use case anyway). Dask itself seems like it might work, although it seems to require running manually on each node. Will look into it some more. On Sun, Aug 7, 2016 at 12:06 PM federico vaggi wrote: > This might be interesting to you: > > http://blaze.pydata.org/blog/2015/10/19/dask-learn/ > > > On Sun, 7 Aug 2016 at 10:42 Vlad Ionescu wrote: > >> Hello, >> >> I am interested in scaling grid searches on an HPC LSF cluster with about >> 60 nodes, each with 20 cores. I thought i could just set n_jobs=1000 then >> submit a job with bsub -n 1000, but then I dug deeper and understood that >> the underlying joblib used by scikit-learn will create all of those jobs on >> a single node, resulting in no performance benefits. So I am stuck using a >> single node. >> >> I've read a lengthy discussion some time ago about adding something like >> this in scikit-learn: >> https://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/4F26C3CB.8070603 at ais.uni-bonn.de/ >> >> >> However, it hasn't materialized in any way, as far as I can tell. >> >> Do you know of any way to do this, or any modern cluster computing >> libraries for python that might help me write something myself (I found a >> lot, but it's hard to tell what's considered good or even still under >> development)? >> >> Also, are there still plans to implement this in scikit-learn? You seemed >> to like the idea back then. >> > _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Sun Aug 7 08:39:55 2016 From: ragvrv at gmail.com (Raghav R V) Date: Sun, 7 Aug 2016 14:39:55 +0200 Subject: [scikit-learn] Disable Travis Cache Message-ID: Could someone disable the Travis cache once and for all please? I have seen several frustrating incidents where the Travis fails the PR because of this caching of old files. I also don't understand why it is enabled in the first place. It would really be super helpful if it is disabled for good. Also refer - https://github.com/scikit-learn/scikit-learn/issues/7094 **cc**: Olivier, Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at telecom-paristech.fr Sun Aug 7 09:01:15 2016 From: alexandre.gramfort at telecom-paristech.fr (Alexandre Gramfort) Date: Sun, 7 Aug 2016 15:01:15 +0200 Subject: [scikit-learn] Disable Travis Cache In-Reply-To: References: Message-ID: hi, I just flushed all the caches. HTH Alex On Sun, Aug 7, 2016 at 2:39 PM, Raghav R V wrote: > Could someone disable the Travis cache once and for all please? > > I have seen several frustrating incidents where the Travis fails the PR > because of this caching of old files. > > I also don't understand why it is enabled in the first place. It would > really be super helpful if it is disabled for good. > > Also refer - https://github.com/scikit-learn/scikit-learn/issues/7094 > > **cc**: Olivier, Andy > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From gael.varoquaux at normalesup.org Sun Aug 7 13:30:43 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sun, 7 Aug 2016 19:30:43 +0200 Subject: [scikit-learn] Scaling model selection on a cluster In-Reply-To: References: Message-ID: <20160807173043.GI3335822@phare.normalesup.org> Parallel computing in scikit-learn is built upon on joblib. In the development version of scikit-learn, the included joblib can be extended with a distributed backend: http://distributed.readthedocs.io/en/latest/joblib.html that can distribute code on a cluster. This is still bleeding edge, but this is probably a direction that will see more development. From ionescu.vlad1 at gmail.com Sun Aug 7 17:25:47 2016 From: ionescu.vlad1 at gmail.com (Vlad Ionescu) Date: Sun, 07 Aug 2016 21:25:47 +0000 Subject: [scikit-learn] Scaling model selection on a cluster In-Reply-To: <20160807173043.GI3335822@phare.normalesup.org> References: <20160807173043.GI3335822@phare.normalesup.org> Message-ID: I copy pasted the example in the link you gave, only made the search take a longer time. I used dask-ssh to setup worker nodes and a scheduler, then connected to the scheduler in my code. Tweaking the n_jobs parameters for the randomized search does not get any performance benefits. The connection to the scheduler seems to work, but nothing gets assigned to the workers, because the code doesn't scale. I am using scikit-learn 0.18.dev0 Any ideas? Code and results are below. Only the n_jobs value was changed between executions. I printed an Executor assigned to my scheduler, and it reported 240 cores. import distributed.joblib from joblib import Parallel, parallel_backend from sklearn.datasets import load_digits from sklearn.grid_search import RandomizedSearchCV from sklearn.svm import SVC import numpy as np digits = load_digits() param_space = { 'C': np.logspace(-6, 6, 100), 'gamma': np.logspace(-8, 8, 100), 'tol': np.logspace(-4, -1, 100), 'class_weight': [None, 'balanced'], } model = SVC(kernel='rbf') search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000, verbose=1, *n_jobs=200*) with parallel_backend('distributed', scheduler_host='my_scheduler:8786'): search.fit(digits.data, digits.target) Fitting 3 folds for each of 1000 candidates, totalling 3000 fits [Parallel(n_jobs=200)]: Done 4 tasks | elapsed: 0.5s [Parallel(n_jobs=200)]: Done 292 tasks | elapsed: 6.9s [Parallel(n_jobs=200)]: Done 800 tasks | elapsed: 16.1s [Parallel(n_jobs=200)]: Done 1250 tasks | elapsed: 24.8s [Parallel(n_jobs=200)]: Done 1800 tasks | elapsed: 36.0s [Parallel(n_jobs=200)]: Done 2450 tasks | elapsed: 49.0s [Parallel(*n_jobs=200*)]: Done 3000 out of 3000 | *elapsed: 1.0min finished* ------------------------------------- Fitting 3 folds for each of 1000 candidates, totalling 3000 fits [Parallel(n_jobs=20)]: Done 10 tasks | elapsed: 0.5s [Parallel(n_jobs=20)]: Done 160 tasks | elapsed: 3.7s [Parallel(n_jobs=20)]: Done 410 tasks | elapsed: 8.6s [Parallel(n_jobs=20)]: Done 760 tasks | elapsed: 16.2s [Parallel(n_jobs=20)]: Done 1210 tasks | elapsed: 25.0s [Parallel(n_jobs=20)]: Done 1760 tasks | elapsed: 36.2s [Parallel(n_jobs=20)]: Done 2410 tasks | elapsed: 48.8s [Parallel(*n_jobs=20*)]: Done 3000 out of 3000 | *elapsed: 1.0min finished* On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux wrote: > Parallel computing in scikit-learn is built upon on joblib. In the > development version of scikit-learn, the included joblib can be extended > with a distributed backend: > http://distributed.readthedocs.io/en/latest/joblib.html > that can distribute code on a cluster. > > This is still bleeding edge, but this is probably a direction that will > see more development. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Aug 7 17:39:43 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 7 Aug 2016 17:39:43 -0400 Subject: [scikit-learn] Disable Travis Cache In-Reply-To: References: Message-ID: Why do you think it should be disabled instead of fixed? On 08/07/2016 08:39 AM, Raghav R V wrote: > Could someone disable the Travis cache once and for all please? > > I have seen several frustrating incidents where the Travis fails the > PR because of this caching of old files. > > I also don't understand why it is enabled in the first place. It would > really be super helpful if it is disabled for good. > > Also refer - https://github.com/scikit-learn/scikit-learn/issues/7094 > > **cc**: Olivier, Andy > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Aug 8 01:24:20 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 8 Aug 2016 07:24:20 +0200 Subject: [scikit-learn] Scaling model selection on a cluster In-Reply-To: References: <20160807173043.GI3335822@phare.normalesup.org> Message-ID: <20160808052420.GR3335822@phare.normalesup.org> My guess is that your model evaluations are too fast, and that you are not getting the benefits of distributed computing as the overhead is hiding them. Anyhow, I don't think that this is ready for prime-time usage. It probably requires tweeking and understanding the tradeoffs. G On Sun, Aug 07, 2016 at 09:25:47PM +0000, Vlad Ionescu wrote: > I copy pasted the example in the link you gave, only made the search take a > longer time. I used dask-ssh to setup worker nodes and a scheduler, then > connected to the scheduler in my code. > Tweaking the n_jobs parameters for the randomized search does not get any > performance benefits. The connection to the scheduler seems to work, but > nothing gets assigned to the workers, because the code doesn't scale. > I am using scikit-learn 0.18.dev0 > Any ideas? > Code and results are below. Only the n_jobs value was changed between > executions. I printed an Executor assigned to my scheduler, and it reported 240 > cores. > import distributed.joblib > from joblib import Parallel, parallel_backend > from sklearn.datasets import load_digits > from sklearn.grid_search import RandomizedSearchCV > from sklearn.svm import SVC > import numpy as np > digits = load_digits() > param_space = { > ? ? 'C': np.logspace(-6, 6, 100), > ? ? 'gamma': np.logspace(-8, 8, 100), > ? ? 'tol': np.logspace(-4, -1, 100), > ? ? 'class_weight': [None, 'balanced'], > } > model = SVC(kernel='rbf') > search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000, verbose=1, > n_jobs=200) > with parallel_backend('distributed', scheduler_host='my_scheduler:8786'): > ? ? search.fit(digits.data, digits.target) > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits > [Parallel(n_jobs=200)]: Done ? 4 tasks ? ? ?| elapsed: ? ?0.5s > [Parallel(n_jobs=200)]: Done 292 tasks ? ? ?| elapsed: ? ?6.9s > [Parallel(n_jobs=200)]: Done 800 tasks ? ? ?| elapsed: ? 16.1s > [Parallel(n_jobs=200)]: Done 1250 tasks ? ? ?| elapsed: ? 24.8s > [Parallel(n_jobs=200)]: Done 1800 tasks ? ? ?| elapsed: ? 36.0s > [Parallel(n_jobs=200)]: Done 2450 tasks ? ? ?| elapsed: ? 49.0s > [Parallel(n_jobs=200)]: Done 3000 out of 3000 | elapsed: ?1.0min finished > ------------------------------------- > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits > [Parallel(n_jobs=20)]: Done ?10 tasks ? ? ?| elapsed: ? ?0.5s > [Parallel(n_jobs=20)]: Done 160 tasks ? ? ?| elapsed: ? ?3.7s > [Parallel(n_jobs=20)]: Done 410 tasks ? ? ?| elapsed: ? ?8.6s > [Parallel(n_jobs=20)]: Done 760 tasks ? ? ?| elapsed: ? 16.2s > [Parallel(n_jobs=20)]: Done 1210 tasks ? ? ?| elapsed: ? 25.0s > [Parallel(n_jobs=20)]: Done 1760 tasks ? ? ?| elapsed: ? 36.2s > [Parallel(n_jobs=20)]: Done 2410 tasks ? ? ?| elapsed: ? 48.8s > [Parallel(n_jobs=20)]: Done 3000 out of 3000 | elapsed: ?1.0min finished > ? > On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux > wrote: > Parallel computing in scikit-learn is built upon on joblib. In the > development version of scikit-learn, the included joblib can be extended > with a distributed backend: > http://distributed.readthedocs.io/en/latest/joblib.html > that can distribute code on a cluster. > This is still bleeding edge, but this is probably a direction that will > see more development. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From ionescu.vlad1 at gmail.com Mon Aug 8 02:48:34 2016 From: ionescu.vlad1 at gmail.com (Vlad Ionescu) Date: Mon, 08 Aug 2016 06:48:34 +0000 Subject: [scikit-learn] Scaling model selection on a cluster In-Reply-To: <20160808052420.GR3335822@phare.normalesup.org> References: <20160807173043.GI3335822@phare.normalesup.org> <20160808052420.GR3335822@phare.normalesup.org> Message-ID: I don't think they're too fast. I tried with slower models and bigger data sets as well. I get the best results with n_jobs=20, which is the number of cores on a single node. Anything below is considerably slower, anything above is mostly the same, sometimes a little slower. Is there a way to see what each worker is running? Nothing is reported in the scheduler console window about the workers, just that there is a connection to the scheduler. Should something be reported about the work assigned to workers? If I notice speed benefits going from 1 to 20 n_jobs, surely there should be something noticeable above that as well if the distributed part is running correctly, no? This is a very easily parallelizable task, and my nodes are in a cluster on the same network. I highly doubt it's (just) overhead. Is there anything else that I could look into to try fixing this? Fitting 10 folds for each of 10000 candidates, totalling 100000 fits [Parallel(n_jobs=20)]: Done 10 tasks | elapsed: 0.7s [Parallel(n_jobs=20)]: Done 160 tasks | elapsed: 4.8s [Parallel(n_jobs=20)]: Done 410 tasks | elapsed: 12.6s [Parallel(n_jobs=20)]: Done 760 tasks | elapsed: 23.7s [Parallel(n_jobs=20)]: Done 1210 tasks | elapsed: 37.9s [Parallel(n_jobs=20)]: Done 1760 tasks | elapsed: 55.0s *[Parallel(n_jobs=20)]: Done 2410 tasks | elapsed: 1.2min* --- Fitting 10 folds for each of 10000 candidates, totalling 100000 fits [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 6.2s [Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 27.5s [Parallel(n_jobs=4)]: Done 442 tasks | elapsed: 1.0min *[Parallel(n_jobs=4)]: Done 792 tasks | elapsed: 1.7min* --- Fitting 10 folds for each of 10000 candidates, totalling 100000 fits [Parallel(n_jobs=100)]: Done 250 tasks | elapsed: 9.1s [Parallel(n_jobs=100)]: Done 600 tasks | elapsed: 19.3s [Parallel(n_jobs=100)]: Done 1050 tasks | elapsed: 34.0s [Parallel(n_jobs=100)]: Done 1600 tasks | elapsed: 49.8s *[Parallel(n_jobs=100)]: Done 2250 tasks | elapsed: 1.2min* If 4 workers do 442 tasks in a minute, then 5x=20 workers should ideally do 5x442 = 2210. So double the workers, half the time seems to hold very well until 20 workers. I have a hard time imagining that it would stop holding at exactly the number of cores per node. On Mon, Aug 8, 2016 at 8:25 AM Gael Varoquaux wrote: > My guess is that your model evaluations are too fast, and that you are > not getting the benefits of distributed computing as the overhead is > hiding them. > > Anyhow, I don't think that this is ready for prime-time usage. It > probably requires tweeking and understanding the tradeoffs. > > G > > On Sun, Aug 07, 2016 at 09:25:47PM +0000, Vlad Ionescu wrote: > > I copy pasted the example in the link you gave, only made the search > take a > > longer time. I used dask-ssh to setup worker nodes and a scheduler, then > > connected to the scheduler in my code. > > > Tweaking the n_jobs parameters for the randomized search does not get any > > performance benefits. The connection to the scheduler seems to work, but > > nothing gets assigned to the workers, because the code doesn't scale. > > > I am using scikit-learn 0.18.dev0 > > > Any ideas? > > > Code and results are below. Only the n_jobs value was changed between > > executions. I printed an Executor assigned to my scheduler, and it > reported 240 > > cores. > > > import distributed.joblib > > from joblib import Parallel, parallel_backend > > from sklearn.datasets import load_digits > > from sklearn.grid_search import RandomizedSearchCV > > from sklearn.svm import SVC > > import numpy as np > > > digits = load_digits() > > > param_space = { > > 'C': np.logspace(-6, 6, 100), > > 'gamma': np.logspace(-8, 8, 100), > > 'tol': np.logspace(-4, -1, 100), > > 'class_weight': [None, 'balanced'], > > } > > > model = SVC(kernel='rbf') > > search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000, > verbose=1, > > n_jobs=200) > > > with parallel_backend('distributed', scheduler_host='my_scheduler:8786'): > > search.fit(digits.data, digits.target) > > > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits > > [Parallel(n_jobs=200)]: Done 4 tasks | elapsed: 0.5s > > [Parallel(n_jobs=200)]: Done 292 tasks | elapsed: 6.9s > > [Parallel(n_jobs=200)]: Done 800 tasks | elapsed: 16.1s > > [Parallel(n_jobs=200)]: Done 1250 tasks | elapsed: 24.8s > > [Parallel(n_jobs=200)]: Done 1800 tasks | elapsed: 36.0s > > [Parallel(n_jobs=200)]: Done 2450 tasks | elapsed: 49.0s > > [Parallel(n_jobs=200)]: Done 3000 out of 3000 | elapsed: 1.0min finished > > > ------------------------------------- > > > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits > > [Parallel(n_jobs=20)]: Done 10 tasks | elapsed: 0.5s > > [Parallel(n_jobs=20)]: Done 160 tasks | elapsed: 3.7s > > [Parallel(n_jobs=20)]: Done 410 tasks | elapsed: 8.6s > > [Parallel(n_jobs=20)]: Done 760 tasks | elapsed: 16.2s > > [Parallel(n_jobs=20)]: Done 1210 tasks | elapsed: 25.0s > > [Parallel(n_jobs=20)]: Done 1760 tasks | elapsed: 36.2s > > [Parallel(n_jobs=20)]: Done 2410 tasks | elapsed: 48.8s > > [Parallel(n_jobs=20)]: Done 3000 out of 3000 | elapsed: 1.0min finished > > > > > > > On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux < > gael.varoquaux at normalesup.org> > > wrote: > > > Parallel computing in scikit-learn is built upon on joblib. In the > > development version of scikit-learn, the included joblib can be > extended > > with a distributed backend: > > http://distributed.readthedocs.io/en/latest/joblib.html > > that can distribute code on a cluster. > > > This is still bleeding edge, but this is probably a direction that > will > > see more development. > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ionescu.vlad1 at gmail.com Mon Aug 8 03:59:23 2016 From: ionescu.vlad1 at gmail.com (Vlad Ionescu) Date: Mon, 08 Aug 2016 07:59:23 +0000 Subject: [scikit-learn] Scaling model selection on a cluster In-Reply-To: References: <20160807173043.GI3335822@phare.normalesup.org> <20160808052420.GR3335822@phare.normalesup.org> Message-ID: I realize this is in early stages and I'd like to help improve it, even if just by testing on an actual cluster. All of the examples I've seen are very small, and it's impossible for anyone to notice if they're really running in parallel judging just by the execution time. None of them mention how you can ensure or check that each worker is doing work either. If there's anything I can do to help debug this (I realize it could be a problem on my end though), please let me know. On Mon, Aug 8, 2016 at 9:48 AM Vlad Ionescu wrote: > I don't think they're too fast. I tried with slower models and bigger data > sets as well. I get the best results with n_jobs=20, which is the number of > cores on a single node. Anything below is considerably slower, anything > above is mostly the same, sometimes a little slower. > > Is there a way to see what each worker is running? Nothing is reported in > the scheduler console window about the workers, just that there is a > connection to the scheduler. Should something be reported about the work > assigned to workers? > > If I notice speed benefits going from 1 to 20 n_jobs, surely there should > be something noticeable above that as well if the distributed part is > running correctly, no? This is a very easily parallelizable task, and my > nodes are in a cluster on the same network. I highly doubt it's (just) > overhead. > > Is there anything else that I could look into to try fixing this? > > Fitting 10 folds for each of 10000 candidates, totalling 100000 fits > [Parallel(n_jobs=20)]: Done 10 tasks | elapsed: 0.7s > [Parallel(n_jobs=20)]: Done 160 tasks | elapsed: 4.8s > [Parallel(n_jobs=20)]: Done 410 tasks | elapsed: 12.6s > [Parallel(n_jobs=20)]: Done 760 tasks | elapsed: 23.7s > [Parallel(n_jobs=20)]: Done 1210 tasks | elapsed: 37.9s > [Parallel(n_jobs=20)]: Done 1760 tasks | elapsed: 55.0s > *[Parallel(n_jobs=20)]: Done 2410 tasks | elapsed: 1.2min* > > --- > > Fitting 10 folds for each of 10000 candidates, totalling 100000 fits > [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 6.2s > [Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 27.5s > [Parallel(n_jobs=4)]: Done 442 tasks | elapsed: 1.0min > *[Parallel(n_jobs=4)]: Done 792 tasks | elapsed: 1.7min* > > > --- > > Fitting 10 folds for each of 10000 candidates, totalling 100000 fits > [Parallel(n_jobs=100)]: Done 250 tasks | elapsed: 9.1s > [Parallel(n_jobs=100)]: Done 600 tasks | elapsed: 19.3s > [Parallel(n_jobs=100)]: Done 1050 tasks | elapsed: 34.0s > [Parallel(n_jobs=100)]: Done 1600 tasks | elapsed: 49.8s > *[Parallel(n_jobs=100)]: Done 2250 tasks | elapsed: 1.2min* > > If 4 workers do 442 tasks in a minute, then 5x=20 workers should ideally > do 5x442 = 2210. So double the workers, half the time seems to hold very > well until 20 workers. I have a hard time imagining that it would stop > holding at exactly the number of cores per node. > > On Mon, Aug 8, 2016 at 8:25 AM Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > >> My guess is that your model evaluations are too fast, and that you are >> not getting the benefits of distributed computing as the overhead is >> hiding them. >> >> Anyhow, I don't think that this is ready for prime-time usage. It >> probably requires tweeking and understanding the tradeoffs. >> >> G >> >> On Sun, Aug 07, 2016 at 09:25:47PM +0000, Vlad Ionescu wrote: >> > I copy pasted the example in the link you gave, only made the search >> take a >> > longer time. I used dask-ssh to setup worker nodes and a scheduler, then >> > connected to the scheduler in my code. >> >> > Tweaking the n_jobs parameters for the randomized search does not get >> any >> > performance benefits. The connection to the scheduler seems to work, but >> > nothing gets assigned to the workers, because the code doesn't scale. >> >> > I am using scikit-learn 0.18.dev0 >> >> > Any ideas? >> >> > Code and results are below. Only the n_jobs value was changed between >> > executions. I printed an Executor assigned to my scheduler, and it >> reported 240 >> > cores. >> >> > import distributed.joblib >> > from joblib import Parallel, parallel_backend >> > from sklearn.datasets import load_digits >> > from sklearn.grid_search import RandomizedSearchCV >> > from sklearn.svm import SVC >> > import numpy as np >> >> > digits = load_digits() >> >> > param_space = { >> > 'C': np.logspace(-6, 6, 100), >> > 'gamma': np.logspace(-8, 8, 100), >> > 'tol': np.logspace(-4, -1, 100), >> > 'class_weight': [None, 'balanced'], >> > } >> >> > model = SVC(kernel='rbf') >> > search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000, >> verbose=1, >> > n_jobs=200) >> >> > with parallel_backend('distributed', >> scheduler_host='my_scheduler:8786'): >> > search.fit(digits.data, digits.target) >> >> > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits >> > [Parallel(n_jobs=200)]: Done 4 tasks | elapsed: 0.5s >> > [Parallel(n_jobs=200)]: Done 292 tasks | elapsed: 6.9s >> > [Parallel(n_jobs=200)]: Done 800 tasks | elapsed: 16.1s >> > [Parallel(n_jobs=200)]: Done 1250 tasks | elapsed: 24.8s >> > [Parallel(n_jobs=200)]: Done 1800 tasks | elapsed: 36.0s >> > [Parallel(n_jobs=200)]: Done 2450 tasks | elapsed: 49.0s >> > [Parallel(n_jobs=200)]: Done 3000 out of 3000 | elapsed: 1.0min >> finished >> >> > ------------------------------------- >> >> > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits >> > [Parallel(n_jobs=20)]: Done 10 tasks | elapsed: 0.5s >> > [Parallel(n_jobs=20)]: Done 160 tasks | elapsed: 3.7s >> > [Parallel(n_jobs=20)]: Done 410 tasks | elapsed: 8.6s >> > [Parallel(n_jobs=20)]: Done 760 tasks | elapsed: 16.2s >> > [Parallel(n_jobs=20)]: Done 1210 tasks | elapsed: 25.0s >> > [Parallel(n_jobs=20)]: Done 1760 tasks | elapsed: 36.2s >> > [Parallel(n_jobs=20)]: Done 2410 tasks | elapsed: 48.8s >> > [Parallel(n_jobs=20)]: Done 3000 out of 3000 | elapsed: 1.0min finished >> >> >> > >> >> > On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux < >> gael.varoquaux at normalesup.org> >> > wrote: >> >> > Parallel computing in scikit-learn is built upon on joblib. In the >> > development version of scikit-learn, the included joblib can be >> extended >> > with a distributed backend: >> > http://distributed.readthedocs.io/en/latest/joblib.html >> > that can distribute code on a cluster. >> >> > This is still bleeding edge, but this is probably a direction that >> will >> > see more development. >> >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> -- >> Gael Varoquaux >> Researcher, INRIA Parietal >> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France >> Phone: ++ 33-1-69-08-79-68 >> http://gael-varoquaux.info >> http://twitter.com/GaelVaroquaux >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Mon Aug 8 09:10:26 2016 From: ragvrv at gmail.com (Raghav R V) Date: Mon, 8 Aug 2016 15:10:26 +0200 Subject: [scikit-learn] Disable Travis Cache In-Reply-To: References: Message-ID: I felt we could rather have a clean build and wait for a few more minutes (if that's the disadvantage of disabling caching) than have it pass / fail on old code... On Sun, Aug 7, 2016 at 11:39 PM, Andreas Mueller wrote: > Why do you think it should be disabled instead of fixed? > > > > On 08/07/2016 08:39 AM, Raghav R V wrote: > > Could someone disable the Travis cache once and for all please? > > I have seen several frustrating incidents where the Travis fails the PR > because of this caching of old files. > > I also don't understand why it is enabled in the first place. It would > really be super helpful if it is disabled for good. > > Also refer - https://github.com/scikit-learn/scikit-learn/issues/7094 > > **cc**: Olivier, Andy > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amisra2 at ucsc.edu Mon Aug 8 12:21:57 2016 From: amisra2 at ucsc.edu (Amita Misra) Date: Mon, 8 Aug 2016 09:21:57 -0700 Subject: [scikit-learn] Supervised anomaly detection in time series In-Reply-To: References: <56374299-8CFE-4430-BD1A-CE93F836211D@sebastianraschka.com> Message-ID: Thanks for the pointers and papers. I'd definitely go through this approach and see if it can be applied to my problem. Thanks, Amita On Fri, Aug 5, 2016 at 4:40 PM, Albert Thomas wrote: > Hi, > > About your question on how to learn the parameters of anomaly detection > algorithms using only the negative samples in your case, Nicolas and I > worked on this aspect recently. If you are interested you can have look at: > > - Learning hyperparameters for unsupervised anomaly detection: > https://drive.google.com/file/d/0B8Dg3PBX90KNUTg5NGNOVnFPX0hDN > mJsSTcybzZMSHNPYkd3/view > - How to evaluate the quality of unsupervised anomaly Detection > algorithms?: > https://drive.google.com/file/d/0B8Dg3PBX90KNenV3WjRkR09Bakx5Y > lNyMF9BUXVNem1hb0NR/view > > Best, > Albert > > On Fri, Aug 5, 2016 at 9:34 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > >> > But this might be the kind of problem where you seriously ask how hard >> it would be to gather more data. >> >> >> Yeah, I agree, but this scenario is then typical in a sense of that it is >> an anomaly detection problem rather than a classification problem. I.e., >> you don?t have enough positive labels to fit the model and thus you need to >> do unsupervised learning to learn from the negative class only. >> >> Sure, supervised learning could work well, but I would also explore >> unsupervised learning here and see how that works for you; maybe one-class >> SVM as suggested or EM algorithm based mixture models ( >> http://scikit-learn.org/stable/modules/mixture.html) >> >> Best, >> Sebastian >> >> > On Aug 5, 2016, at 2:55 PM, Jared Gabor wrote: >> > >> > Lots of great suggestions on how to model your problem. But this might >> be the kind of problem where you seriously ask how hard it would be to >> gather more data. >> > >> > On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra wrote: >> > Hi, >> > >> > I am currently exploring the problem of speed bump detection using >> accelerometer time series data. >> > I have extracted some features based on mean, std deviation etc within >> a time window. >> > >> > Since the dataset is highly skewed ( I have just 5 positive samples >> for every > 300 samples) >> > I was looking into >> > >> > One ClassSVM >> > covariance.EllipticEnvelope >> > sklearn.ensemble.IsolationForest >> > but I am not sure how to use them. >> > >> > What I get from docs >> > >> > separate the positive examples and train using only negative examples >> > clf.fit(X_train) >> > and then >> > predict the positive examples using >> > clf.predict(X_test) >> > >> > >> > I am not sure what is then the role of positive examples in my training >> dataset or how can I use them to improve my classifier so that I can >> predict better on new samples. >> > >> > >> > Can we do something like Cross validation to learn the parameters as in >> normal binary SVM classification >> > >> > Thanks,? >> > Amita >> > >> > Amita Misra >> > Graduate Student Researcher >> > Natural Language and Dialogue Systems Lab >> > Baskin School of Engineering >> > University of California Santa Cruz >> > >> > >> > >> > >> > >> > -- >> > Amita Misra >> > Graduate Student Researcher >> > Natural Language and Dialogue Systems Lab >> > Baskin School of Engineering >> > University of California Santa Cruz >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Amita Misra Graduate Student Researcher Natural Language and Dialogue Systems Lab Baskin School of Engineering University of California Santa Cruz -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Aug 8 12:22:50 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 8 Aug 2016 18:22:50 +0200 Subject: [scikit-learn] Disable Travis Cache In-Reply-To: References: Message-ID: <20160808162250.GC3335822@phare.normalesup.org> On Mon, Aug 08, 2016 at 03:10:26PM +0200, Raghav R V wrote: > I felt we could rather have a clean build and wait for a few more minutes (if > that's the disadvantage of disabling caching) than have it pass / fail on old > code... Time of CI is a real problem. On our side because it slows down merges of PRs, and for the infrastructure: it's server time and money for them. I'd much rather have a working cache. Ga?l From ragvrv at gmail.com Mon Aug 8 12:56:29 2016 From: ragvrv at gmail.com (Raghav R V) Date: Mon, 8 Aug 2016 18:56:29 +0200 Subject: [scikit-learn] Disable Travis Cache In-Reply-To: <20160808162250.GC3335822@phare.normalesup.org> References: <20160808162250.GC3335822@phare.normalesup.org> Message-ID: Ok. Thanks for the comments! On Mon, Aug 8, 2016 at 6:22 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > On Mon, Aug 08, 2016 at 03:10:26PM +0200, Raghav R V wrote: > > I felt we could rather have a clean build and wait for a few more > minutes (if > > that's the disadvantage of disabling caching) than have it pass / fail > on old > > code... > > Time of CI is a real problem. On our side because it slows down merges of > PRs, and for the infrastructure: it's server time and money for them. > > I'd much rather have a working cache. > > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zude07 at yahoo.com Thu Aug 11 07:21:22 2016 From: zude07 at yahoo.com (Ali Zude) Date: Thu, 11 Aug 2016 11:21:22 +0000 (UTC) Subject: [scikit-learn] Speeding up RF regressors References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com> Hi all, I've 6 RF models and I am using them online to predict 6 different variables (using the same features), models quality (error in test data is good). However, the online prediction is very very slow. How can I speed up the prediction? - ??? Can I import models into C++ code? - ??? Is it useful to upgrade to scikit-learn 0.18? and then use multi-output models? - ??? Is sklearn-compiledtreesuseful, they are claiming that it will speed the prediction (5x-8x)times? - I could not use because of array2d error >>PyPi Thank you for your help RegardsAli -------------- next part -------------- An HTML attachment was scrubbed... URL: From maciek at wojcikowski.pl Thu Aug 11 07:26:46 2016 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Thu, 11 Aug 2016 13:26:46 +0200 Subject: [scikit-learn] Speeding up RF regressors In-Reply-To: <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com> References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com> <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com> Message-ID: Hi Ali, I'm using sklearn-compiledtrees [ https://github.com/ajtulloch/sklearn-compiledtrees] on quite large trees (pickle size ~1GB, compiled ~100MB) and the speedup is gigantic (never measured it properly) but I'd say it's over 10x. ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2016-08-11 13:21 GMT+02:00 Ali Zude via scikit-learn < scikit-learn at python.org>: > Hi all, > > I've 6 RF models and I am using them online to predict 6 different > variables (using the same features), models quality (error in test data is > good). However, the online prediction is very very slow. > How can I speed up the prediction? > > - Can I import models into C++ code? > - Is it useful to upgrade to scikit-learn 0.18? and then use > multi-output models? > - Is sklearn-compiledtreesuseful, they are claiming that it will > speed the prediction (5x-8x)times? > - I could not use because of array2d error >>PyPi > > Thank you for your help > > Regards > Ali > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zude07 at yahoo.com Thu Aug 11 07:31:24 2016 From: zude07 at yahoo.com (Ali Zude) Date: Thu, 11 Aug 2016 11:31:24 +0000 (UTC) Subject: [scikit-learn] Speeding up RF regressors In-Reply-To: References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com> <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com> Message-ID: <2056525296.12336951.1470915084854.JavaMail.yahoo@mail.yahoo.com> Thnx Maciek, I've tried to use it but I could not sort out the PyPi problem,? see the error below. Thanks in advance. ---> 16 import compiledtrees /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/__init__.py in () ----> 1 from compiledtrees.compiled import CompiledRegressionPredictor 2 3 __all__ = ["CompiledRegressionPredictor"] /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/compiled.py in () 1 from __future__ import print_function 2 ----> 3 from sklearn.utils import array2d 4 from sklearn.tree.tree import DecisionTreeRegressor, DTYPE 5 from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor ImportError: cannot import name array2d Kind regards Ali Von: Maciek W?jcikowski An: Ali Zude ; Scikit-learn user and developer mailing list Gesendet: 12:26 Donnerstag, 11.August 2016 Betreff: Re: [scikit-learn] Speeding up RF regressors Hi Ali, I'm using sklearn-compiledtrees [https://github.com/ajtulloch/sklearn-compiledtrees] on quite large trees (pickle size ~1GB, compiled ~100MB) and the speedup is gigantic (never measured it properly) but I'd say it's over 10x. ---- Pozdrawiam, ?| ?Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2016-08-11 13:21 GMT+02:00 Ali Zude via scikit-learn : Hi all, I've 6 RF models and I am using them online to predict 6 different variables (using the same features), models quality (error in test data is good). However, the online prediction is very very slow. How can I speed up the prediction? - ??? Can I import models into C++ code? - ??? Is it useful to upgrade to scikit-learn 0.18? and then use multi-output models? - ??? Is sklearn-compiledtreesuseful, they are claiming that it will speed the prediction (5x-8x)times? - I could not use because of array2d error >>PyPi Thank you for your help RegardsAli ______________________________ _________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/ mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From maciek at wojcikowski.pl Thu Aug 11 09:10:48 2016 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Thu, 11 Aug 2016 15:10:48 +0200 Subject: [scikit-learn] Speeding up RF regressors In-Reply-To: <2056525296.12336951.1470915084854.JavaMail.yahoo@mail.yahoo.com> References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com> <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com> <2056525296.12336951.1470915084854.JavaMail.yahoo@mail.yahoo.com> Message-ID: First of all the pypi version is outdated, please install using > > pip install git+https://github.com/ajtulloch/sklearn-compiledtrees.git Secondly, which scikit-learn version are you using? ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2016-08-11 13:31 GMT+02:00 Ali Zude : > Thnx Maciek, > > I've tried to use it but I could not sort out the PyPi problem, see the > error below. Thanks in advance. > > ---> 16 import compiledtrees > /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/__init__.py in ()----> 1 from compiledtrees.compiled import CompiledRegressionPredictor 2 3 __all__ = ["CompiledRegressionPredictor"] > /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/compiled.py in () 1 from __future__ import print_function 2 ----> 3 from sklearn.utils import array2d 4 from sklearn.tree.tree import DecisionTreeRegressor, DTYPE 5 from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor > ImportError: cannot import name array2d > > > Kind regards > Ali > > ------------------------------ > *Von:* Maciek W?jcikowski > *An:* Ali Zude ; Scikit-learn user and developer > mailing list > *Gesendet:* 12:26 Donnerstag, 11.August 2016 > *Betreff:* Re: [scikit-learn] Speeding up RF regressors > > Hi Ali, > > I'm using sklearn-compiledtrees [https://github.com/ajtulloch/ > sklearn-compiledtrees] on quite large trees (pickle size ~1GB, compiled > ~100MB) and the speedup is gigantic (never measured it properly) but I'd > say it's over 10x. > > ---- > Pozdrawiam, | Best regards, > Maciek W?jcikowski > maciek at wojcikowski.pl > > 2016-08-11 13:21 GMT+02:00 Ali Zude via scikit-learn < > scikit-learn at python.org>: > > Hi all, > > I've 6 RF models and I am using them online to predict 6 different > variables (using the same features), models quality (error in test data is > good). However, the online prediction is very very slow. > How can I speed up the prediction? > > - Can I import models into C++ code? > - Is it useful to upgrade to scikit-learn 0.18? and then use > multi-output models? > - Is sklearn-compiledtreesuseful, they are claiming that it will > speed the prediction (5x-8x)times? > - I could not use because of array2d error >>PyPi > > Thank you for your help > > Regards > Ali > > ______________________________ _________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/ mailman/listinfo/scikit-learn > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From odaym2 at gmail.com Thu Aug 11 09:41:52 2016 From: odaym2 at gmail.com (o m) Date: Thu, 11 Aug 2016 09:41:52 -0400 Subject: [scikit-learn] Speeding up RF regressors In-Reply-To: References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com> <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com> <2056525296.12336951.1470915084854.JavaMail.yahoo@mail.yahoo.com> Message-ID: <4CC9E9A2-2EC1-4C31-B2ED-F9B86854CC41@gmail.com> Can someone please take me off this list? Thanks Sent from my iPhone > On Aug 11, 2016, at 9:10 AM, Maciek W?jcikowski wrote: > > First of all the pypi version is outdated, please install using >> >> pip install git+https://github.com/ajtulloch/sklearn-compiledtrees.git > > Secondly, which scikit-learn version are you using? > > ---- > Pozdrawiam, | Best regards, > Maciek W?jcikowski > maciek at wojcikowski.pl > > 2016-08-11 13:31 GMT+02:00 Ali Zude : >> Thnx Maciek, >> >> I've tried to use it but I could not sort out the PyPi problem, see the error below. Thanks in advance. >> >> ---> 16 import compiledtrees >> >> /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/__init__.py in () >> ----> 1 from compiledtrees.compiled import CompiledRegressionPredictor >> 2 >> 3 __all__ = ["CompiledRegressionPredictor"] >> >> /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/compiled.py in () >> 1 from __future__ import print_function >> 2 >> ----> 3 from sklearn.utils import array2d >> 4 from sklearn.tree.tree import DecisionTreeRegressor, DTYPE >> 5 from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor >> >> ImportError: cannot import name array2d >> >> >> Kind regards >> Ali >> >> Von: Maciek W?jcikowski >> An: Ali Zude ; Scikit-learn user and developer mailing list >> Gesendet: 12:26 Donnerstag, 11.August 2016 >> Betreff: Re: [scikit-learn] Speeding up RF regressors >> >> Hi Ali, >> >> I'm using sklearn-compiledtrees [https://github.com/ajtulloch/sklearn-compiledtrees] on quite large trees (pickle size ~1GB, compiled ~100MB) and the speedup is gigantic (never measured it properly) but I'd say it's over 10x. >> >> ---- >> Pozdrawiam, | Best regards, >> Maciek W?jcikowski >> maciek at wojcikowski.pl >> >> 2016-08-11 13:21 GMT+02:00 Ali Zude via scikit-learn : >> Hi all, >> >> I've 6 RF models and I am using them online to predict 6 different variables (using the same features), models quality (error in test data is good). However, the online prediction is very very slow. >> How can I speed up the prediction? >> Can I import models into C++ code? >> Is it useful to upgrade to scikit-learn 0.18? and then use multi-output models? >> Is sklearn-compiledtreesuseful, they are claiming that it will speed the prediction (5x-8x)times? >> I could not use because of array2d error >>PyPi >> Thank you for your help >> >> Regards >> Ali >> >> ______________________________ _________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/ mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad32.de at gmail.com Thu Aug 11 09:45:43 2016 From: vlad32.de at gmail.com (Vlad Deshkovich) Date: Thu, 11 Aug 2016 09:45:43 -0400 Subject: [scikit-learn] Speeding up RF regressors In-Reply-To: <4CC9E9A2-2EC1-4C31-B2ED-F9B86854CC41@gmail.com> References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com> <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com> <2056525296.12336951.1470915084854.JavaMail.yahoo@mail.yahoo.com> <4CC9E9A2-2EC1-4C31-B2ED-F9B86854CC41@gmail.com> Message-ID: Please remove me as well. On Thursday, August 11, 2016, o m wrote: > Can someone please take me off this list? Thanks > > Sent from my iPhone > > On Aug 11, 2016, at 9:10 AM, Maciek W?jcikowski > wrote: > > First of all the pypi version is outdated, please install using >> >> pip install git+https://github.com/ajtulloch/sklearn-compiledtrees.git > > > Secondly, which scikit-learn version are you using? > > ---- > Pozdrawiam, | Best regards, > Maciek W?jcikowski > maciek at wojcikowski.pl > > > 2016-08-11 13:31 GMT+02:00 Ali Zude >: > >> Thnx Maciek, >> >> I've tried to use it but I could not sort out the PyPi problem, see the >> error below. Thanks in advance. >> >> ---> 16 import compiledtrees >> /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/__init__.py in ()----> 1 from compiledtrees.compiled import CompiledRegressionPredictor 2 3 __all__ = ["CompiledRegressionPredictor"] >> /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/compiled.py in () 1 from __future__ import print_function 2 ----> 3 from sklearn.utils import array2d 4 from sklearn.tree.tree import DecisionTreeRegressor, DTYPE 5 from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor >> ImportError: cannot import name array2d >> >> >> Kind regards >> Ali >> >> ------------------------------ >> *Von:* Maciek W?jcikowski > > >> *An:* Ali Zude > >; Scikit-learn user >> and developer mailing list > > >> *Gesendet:* 12:26 Donnerstag, 11.August 2016 >> *Betreff:* Re: [scikit-learn] Speeding up RF regressors >> >> Hi Ali, >> >> I'm using sklearn-compiledtrees [https://github.com/ajtulloch/ >> sklearn-compiledtrees] on quite large trees (pickle size ~1GB, compiled >> ~100MB) and the speedup is gigantic (never measured it properly) but I'd >> say it's over 10x. >> >> ---- >> Pozdrawiam, | Best regards, >> Maciek W?jcikowski >> maciek at wojcikowski.pl >> >> >> 2016-08-11 13:21 GMT+02:00 Ali Zude via scikit-learn < >> scikit-learn at python.org >> >: >> >> Hi all, >> >> I've 6 RF models and I am using them online to predict 6 different >> variables (using the same features), models quality (error in test data is >> good). However, the online prediction is very very slow. >> How can I speed up the prediction? >> >> - Can I import models into C++ code? >> - Is it useful to upgrade to scikit-learn 0.18? and then use >> multi-output models? >> - Is sklearn-compiledtreesuseful, they are claiming that it will >> speed the prediction (5x-8x)times? >> - I could not use because of array2d error >>PyPi >> >> Thank you for your help >> >> Regards >> Ali >> >> ______________________________ _________________ >> scikit-learn mailing list >> scikit-learn at python.org >> >> https://mail.python.org/ mailman/listinfo/scikit-learn >> >> >> >> >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zude07 at yahoo.com Thu Aug 11 17:39:30 2016 From: zude07 at yahoo.com (Ali Zude) Date: Thu, 11 Aug 2016 21:39:30 +0000 (UTC) Subject: [scikit-learn] Compiled trees References: <1676605323.13315284.1470951570354.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1676605323.13315284.1470951570354.JavaMail.yahoo@mail.yahoo.com> Dear All, I am trying to speed up the prediction of Random Forests. I've used compiledtress, which was useful, but since I have 6 models and once I've loaded all of them I got "Multiprocessing exception:" here is my models in the code: ...model1=joblib.load('/models/model1.pkl'') model2=joblib.load('/models/model2.pkl') model3=joblib.load('/models/model3.pkl') model4=compiledtrees.CompiledRegressionPredictor(joblib.load('/models/model4.pkl')) model5=compiledtrees.CompiledRegressionPredictor(joblib.load('/models/model4.pkl')) model6=compiledtrees.CompiledRegressionPredictor(joblib.load('/models/model4.pkl')) model1=compiledtrees.CompiledRegressionPredictor(model1) model2=compiledtrees.CompiledRegressionPredictor(model2) model3=compiledtrees.CompiledRegressionPredictor(model3).... Now I'm trying to use MultiOutputRegressor(RandomForestRegressor()), however, I could not find any tool to do model selection, can anyone help me either to solve the first problem or the second one Best regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From maciek at wojcikowski.pl Fri Aug 12 02:30:03 2016 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Fri, 12 Aug 2016 08:30:03 +0200 Subject: [scikit-learn] Compiled trees In-Reply-To: <1676605323.13315284.1470951570354.JavaMail.yahoo@mail.yahoo.com> References: <1676605323.13315284.1470951570354.JavaMail.yahoo.ref@mail.yahoo.com> <1676605323.13315284.1470951570354.JavaMail.yahoo@mail.yahoo.com> Message-ID: Which version of compiledtrees are you using? ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2016-08-11 23:39 GMT+02:00 Ali Zude via scikit-learn < scikit-learn at python.org>: > Dear All, > > I am trying to speed up the prediction of Random Forests. I've used > compiledtress, which was useful, but since I have 6 models and once I've > loaded all of them I got "Multiprocessing exception:" > > here is my models in the code: > ... > model1=joblib.load('/models/model1.pkl'') > model2=joblib.load('/models/model2.pkl') > model3=joblib.load('/models/model3.pkl') > model4=compiledtrees.CompiledRegressionPredictor( > joblib.load('/models/model4.pkl')) > model5=compiledtrees.CompiledRegressionPredictor( > joblib.load('/models/model4.pkl')) > model6=compiledtrees.CompiledRegressionPredictor( > joblib.load('/models/model4.pkl')) > > model1=compiledtrees.CompiledRegressionPredictor(model1) > model2=compiledtrees.CompiledRegressionPredictor(model2) > model3=compiledtrees.CompiledRegressionPredictor(model3) > .... > > Now I'm trying to use MultiOutputRegressor(RandomForestRegressor()), > however, I could not find any tool to do model selection, can anyone help > me either to solve the first problem or the second one > > Best regards > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zude07 at yahoo.com Fri Aug 12 02:37:37 2016 From: zude07 at yahoo.com (Ali Zude) Date: Fri, 12 Aug 2016 06:37:37 +0000 (UTC) Subject: [scikit-learn] Compiled trees In-Reply-To: References: <1676605323.13315284.1470951570354.JavaMail.yahoo.ref@mail.yahoo.com> <1676605323.13315284.1470951570354.JavaMail.yahoo@mail.yahoo.com> Message-ID: <616174978.13126540.1470983857571.JavaMail.yahoo@mail.yahoo.com> sklearn-compiledtrees==1.3 Von: Maciek W?jcikowski An: Ali Zude ; Scikit-learn user and developer mailing list Gesendet: 7:30 Freitag, 12.August 2016 Betreff: Re: [scikit-learn] Compiled trees Which version of compiledtrees are you using? ---- Pozdrawiam, ?| ?Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2016-08-11 23:39 GMT+02:00 Ali Zude via scikit-learn : Dear All, I am trying to speed up the prediction of Random Forests. I've used compiledtress, which was useful, but since I have 6 models and once I've loaded all of them I got "Multiprocessing exception:" here is my models in the code: ...model1=joblib.load('/models/ model1.pkl'') model2=joblib.load('/models/ model2.pkl') model3=joblib.load('/models/ model3.pkl') model4=compiledtrees. CompiledRegressionPredictor( joblib.load('/models/model4. pkl')) model5=compiledtrees. CompiledRegressionPredictor( joblib.load('/models/model4. pkl')) model6=compiledtrees. CompiledRegressionPredictor( joblib.load('/models/model4. pkl')) model1=compiledtrees. CompiledRegressionPredictor( model1) model2=compiledtrees. CompiledRegressionPredictor( model2) model3=compiledtrees. CompiledRegressionPredictor( model3).... Now I'm trying to use MultiOutputRegressor( RandomForestRegressor()), however, I could not find any tool to do model selection, can anyone help me either to solve the first problem or the second one Best regards ______________________________ _________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/ mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From maciek at wojcikowski.pl Fri Aug 12 03:48:19 2016 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Fri, 12 Aug 2016 09:48:19 +0200 Subject: [scikit-learn] Compiled trees In-Reply-To: <616174978.13126540.1470983857571.JavaMail.yahoo@mail.yahoo.com> References: <1676605323.13315284.1470951570354.JavaMail.yahoo.ref@mail.yahoo.com> <1676605323.13315284.1470951570354.JavaMail.yahoo@mail.yahoo.com> <616174978.13126540.1470983857571.JavaMail.yahoo@mail.yahoo.com> Message-ID: Can you please copy whole error message? There seams to be a problem with compiling the tree. Do you have gcc or other C compiler under CXX shell variable? ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2016-08-12 8:37 GMT+02:00 Ali Zude : > sklearn-compiledtrees==1.3 > > > ------------------------------ > *Von:* Maciek W?jcikowski > *An:* Ali Zude ; Scikit-learn user and developer > mailing list > *Gesendet:* 7:30 Freitag, 12.August 2016 > *Betreff:* Re: [scikit-learn] Compiled trees > > Which version of compiledtrees are you using? > > ---- > Pozdrawiam, | Best regards, > Maciek W?jcikowski > maciek at wojcikowski.pl > > 2016-08-11 23:39 GMT+02:00 Ali Zude via scikit-learn < > scikit-learn at python.org>: > > Dear All, > > I am trying to speed up the prediction of Random Forests. I've used > compiledtress, which was useful, but since I have 6 models and once I've > loaded all of them I got "Multiprocessing exception:" > > here is my models in the code: > ... > model1=joblib.load('/models/ model1.pkl'') > model2=joblib.load('/models/ model2.pkl') > model3=joblib.load('/models/ model3.pkl') > model4=compiledtrees. CompiledRegressionPredictor( > joblib.load('/models/model4. pkl')) > model5=compiledtrees. CompiledRegressionPredictor( > joblib.load('/models/model4. pkl')) > model6=compiledtrees. CompiledRegressionPredictor( > joblib.load('/models/model4. pkl')) > > model1=compiledtrees. CompiledRegressionPredictor( model1) > model2=compiledtrees. CompiledRegressionPredictor( model2) > model3=compiledtrees. CompiledRegressionPredictor( model3) > .... > > Now I'm trying to use MultiOutputRegressor( RandomForestRegressor()), > however, I could not find any tool to do model selection, can anyone help > me either to solve the first problem or the second one > > Best regards > > ______________________________ _________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/ mailman/listinfo/scikit-learn > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris at upnix.com Mon Aug 15 17:27:03 2016 From: chris at upnix.com (Chris Cameron) Date: Mon, 15 Aug 2016 15:27:03 -0600 Subject: [scikit-learn] Inconsistent Logistic Regression fit results Message-ID: Hi all, Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results. The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit. The code I?m using: def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted) return [kappa, accuracy] I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on. Example output: --- log_run(df_save, y) Out[32]: [0.027777777777777728, 0.53333333333333333] log_run(df_save, y) Out[33]: [0.027777777777777728, 0.53333333333333333] log_run(df_save, y) Out[34]: [0.11347517730496456, 0.58333333333333337] log_run(df_save, y) Out[35]: [0.042553191489361743, 0.55000000000000004] log_run(df_save, y) Out[36]: [-0.07407407407407407, 0.51666666666666672] log_run(df_save, y) Out[37]: [0.042553191489361743, 0.55000000000000004] A little information on the problem DataFrame: --- len(df_save) Out[40]: 240 len(df_save.columns) Out[41]: 18 If I omit this particular column the Kappa no longer fluctuates: df_save[?abc'].head() Out[42]: 0 0.026316 1 0.333333 2 0.015152 3 0.010526 4 0.125000 Name: abc, dtype: float64 Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed? Thanks! Chris From mail at sebastianraschka.com Mon Aug 15 17:42:10 2016 From: mail at sebastianraschka.com (mail at sebastianraschka.com) Date: Mon, 15 Aug 2016 17:42:10 -0400 Subject: [scikit-learn] Inconsistent Logistic Regression fit results In-Reply-To: References: Message-ID: <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com> Hi, Chris, have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven?t done so, yet. Best, Sebastian > On Aug 15, 2016, at 5:27 PM, Chris Cameron wrote: > > Hi all, > > Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results. > > The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit. > > The code I?m using: > > def log_run(logreg_x, logreg_y): > logreg_x['pass_fail'] = logreg_y > df_train, df_test = train_test_split(logreg_x, random_state=0) > y_train = df_train.pass_fail.as_matrix() > y_test = df_test.pass_fail.as_matrix() > del(df_train['pass_fail']) > del(df_test['pass_fail']) > log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) > predicted = log_reg_fit.predict(df_test) > accuracy = accuracy_score(y_test, predicted) > kappa = cohen_kappa_score(y_test, predicted) > > return [kappa, accuracy] > > > I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on. > > Example output: > --- > log_run(df_save, y) > Out[32]: [0.027777777777777728, 0.53333333333333333] > > log_run(df_save, y) > Out[33]: [0.027777777777777728, 0.53333333333333333] > > log_run(df_save, y) > Out[34]: [0.11347517730496456, 0.58333333333333337] > > log_run(df_save, y) > Out[35]: [0.042553191489361743, 0.55000000000000004] > > log_run(df_save, y) > Out[36]: [-0.07407407407407407, 0.51666666666666672] > > log_run(df_save, y) > Out[37]: [0.042553191489361743, 0.55000000000000004] > > A little information on the problem DataFrame: > --- > len(df_save) > Out[40]: 240 > > len(df_save.columns) > Out[41]: 18 > > > If I omit this particular column the Kappa no longer fluctuates: > > df_save[?abc'].head() > Out[42]: > 0 0.026316 > 1 0.333333 > 2 0.015152 > 3 0.010526 > 4 0.125000 > Name: abc, dtype: float64 > > > Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed? > > > Thanks! > Chris > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From chris at upnix.com Mon Aug 15 18:00:05 2016 From: chris at upnix.com (Chris Cameron) Date: Mon, 15 Aug 2016 16:00:05 -0600 Subject: [scikit-learn] Inconsistent Logistic Regression fit results In-Reply-To: <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com> References: <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com> Message-ID: <903082E9-D944-4838-A882-982911520540@upnix.com> Sebastian, That doesn?t do it. With the function: def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced', tol=0.000000001, random_state=0).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted) return [kappa, accuracy] I?m still seeing: log_run(df_save, y) Out[7]: [-0.054421768707483005, 0.48333333333333334] log_run(df_save, y) Out[8]: [0.042553191489361743, 0.55000000000000004] log_run(df_save, y) Out[9]: [0.042553191489361743, 0.55000000000000004] log_run(df_save, y) Out[10]: [0.027777777777777728, 0.53333333333333333] Chris > On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote: > > Hi, Chris, > have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven?t done so, yet. > > Best, > Sebastian > > > >> On Aug 15, 2016, at 5:27 PM, Chris Cameron wrote: >> >> Hi all, >> >> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results. >> >> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit. >> >> The code I?m using: >> >> def log_run(logreg_x, logreg_y): >> logreg_x['pass_fail'] = logreg_y >> df_train, df_test = train_test_split(logreg_x, random_state=0) >> y_train = df_train.pass_fail.as_matrix() >> y_test = df_test.pass_fail.as_matrix() >> del(df_train['pass_fail']) >> del(df_test['pass_fail']) >> log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) >> predicted = log_reg_fit.predict(df_test) >> accuracy = accuracy_score(y_test, predicted) >> kappa = cohen_kappa_score(y_test, predicted) >> >> return [kappa, accuracy] >> >> >> I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on. >> >> Example output: >> --- >> log_run(df_save, y) >> Out[32]: [0.027777777777777728, 0.53333333333333333] >> >> log_run(df_save, y) >> Out[33]: [0.027777777777777728, 0.53333333333333333] >> >> log_run(df_save, y) >> Out[34]: [0.11347517730496456, 0.58333333333333337] >> >> log_run(df_save, y) >> Out[35]: [0.042553191489361743, 0.55000000000000004] >> >> log_run(df_save, y) >> Out[36]: [-0.07407407407407407, 0.51666666666666672] >> >> log_run(df_save, y) >> Out[37]: [0.042553191489361743, 0.55000000000000004] >> >> A little information on the problem DataFrame: >> --- >> len(df_save) >> Out[40]: 240 >> >> len(df_save.columns) >> Out[41]: 18 >> >> >> If I omit this particular column the Kappa no longer fluctuates: >> >> df_save[?abc'].head() >> Out[42]: >> 0 0.026316 >> 1 0.333333 >> 2 0.015152 >> 3 0.010526 >> 4 0.125000 >> Name: abc, dtype: float64 >> >> >> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed? >> >> >> Thanks! >> Chris >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Mon Aug 15 18:17:25 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 15 Aug 2016 18:17:25 -0400 Subject: [scikit-learn] Inconsistent Logistic Regression fit results In-Reply-To: <903082E9-D944-4838-A882-982911520540@upnix.com> References: <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com> <903082E9-D944-4838-A882-982911520540@upnix.com> Message-ID: <31533467-ef95-6696-23cb-ac4ce5c1d9ff@gmail.com> Hm that looks kinda convoluted. Why don't you just do df_train, df_test, y_train, y_test = train_test_split(logreg_x, logreg_y, random_state=0) ? What version of scikit-learn are you using? Also, you are modifying the inputs. Can you try to do the same but pass a copy of the input dataframe to the method each time? On 08/15/2016 06:00 PM, Chris Cameron wrote: > Sebastian, > > That doesn?t do it. With the function: > > def log_run(logreg_x, logreg_y): > logreg_x['pass_fail'] = logreg_y > df_train, df_test = train_test_split(logreg_x, random_state=0) > y_train = df_train.pass_fail.as_matrix() > y_test = df_test.pass_fail.as_matrix() > del(df_train['pass_fail']) > del(df_test['pass_fail']) > log_reg_fit = LogisticRegression(class_weight='balanced', > tol=0.000000001, > random_state=0).fit(df_train, y_train) > predicted = log_reg_fit.predict(df_test) > accuracy = accuracy_score(y_test, predicted) > kappa = cohen_kappa_score(y_test, predicted) > > return [kappa, accuracy] > > I?m still seeing: > log_run(df_save, y) > Out[7]: [-0.054421768707483005, 0.48333333333333334] > > log_run(df_save, y) > Out[8]: [0.042553191489361743, 0.55000000000000004] > > log_run(df_save, y) > Out[9]: [0.042553191489361743, 0.55000000000000004] > > log_run(df_save, y) > Out[10]: [0.027777777777777728, 0.53333333333333333] > > > Chris > >> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote: >> >> Hi, Chris, >> have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven?t done so, yet. >> >> Best, >> Sebastian >> >> >> >>> On Aug 15, 2016, at 5:27 PM, Chris Cameron wrote: >>> >>> Hi all, >>> >>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results. >>> >>> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit. >>> >>> The code I?m using: >>> >>> def log_run(logreg_x, logreg_y): >>> logreg_x['pass_fail'] = logreg_y >>> df_train, df_test = train_test_split(logreg_x, random_state=0) >>> y_train = df_train.pass_fail.as_matrix() >>> y_test = df_test.pass_fail.as_matrix() >>> del(df_train['pass_fail']) >>> del(df_test['pass_fail']) >>> log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) >>> predicted = log_reg_fit.predict(df_test) >>> accuracy = accuracy_score(y_test, predicted) >>> kappa = cohen_kappa_score(y_test, predicted) >>> >>> return [kappa, accuracy] >>> >>> >>> I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on. >>> >>> Example output: >>> --- >>> log_run(df_save, y) >>> Out[32]: [0.027777777777777728, 0.53333333333333333] >>> >>> log_run(df_save, y) >>> Out[33]: [0.027777777777777728, 0.53333333333333333] >>> >>> log_run(df_save, y) >>> Out[34]: [0.11347517730496456, 0.58333333333333337] >>> >>> log_run(df_save, y) >>> Out[35]: [0.042553191489361743, 0.55000000000000004] >>> >>> log_run(df_save, y) >>> Out[36]: [-0.07407407407407407, 0.51666666666666672] >>> >>> log_run(df_save, y) >>> Out[37]: [0.042553191489361743, 0.55000000000000004] >>> >>> A little information on the problem DataFrame: >>> --- >>> len(df_save) >>> Out[40]: 240 >>> >>> len(df_save.columns) >>> Out[41]: 18 >>> >>> >>> If I omit this particular column the Kappa no longer fluctuates: >>> >>> df_save[?abc'].head() >>> Out[42]: >>> 0 0.026316 >>> 1 0.333333 >>> 2 0.015152 >>> 3 0.010526 >>> 4 0.125000 >>> Name: abc, dtype: float64 >>> >>> >>> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed? >>> >>> >>> Thanks! >>> Chris >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mail at sebastianraschka.com Mon Aug 15 18:26:28 2016 From: mail at sebastianraschka.com (mail at sebastianraschka.com) Date: Mon, 15 Aug 2016 18:26:28 -0400 Subject: [scikit-learn] Inconsistent Logistic Regression fit results In-Reply-To: <903082E9-D944-4838-A882-982911520540@upnix.com> References: <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com> <903082E9-D944-4838-A882-982911520540@upnix.com> Message-ID: hm, was worth a try. What happens if you change the solver to something else than liblinear, does this issue still persist? Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably unrelated to your issue, I?d recommend setting > y_train = df_train.pass_fail.values > y_test = df_test.pass_fail.values instead of > y_train = df_train.pass_fail.as_matrix() > y_test = df_test.pass_fail.as_matrix() Also, try passing NumPy arrays to the fit method: > log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train) and > predicted = log_reg_fit.predict(df_test.values) and so forth. > On Aug 15, 2016, at 6:00 PM, Chris Cameron wrote: > > Sebastian, > > That doesn?t do it. With the function: > > def log_run(logreg_x, logreg_y): > logreg_x['pass_fail'] = logreg_y > df_train, df_test = train_test_split(logreg_x, random_state=0) > y_train = df_train.pass_fail.as_matrix() > y_test = df_test.pass_fail.as_matrix() > del(df_train['pass_fail']) > del(df_test['pass_fail']) > log_reg_fit = LogisticRegression(class_weight='balanced', > tol=0.000000001, > random_state=0).fit(df_train, y_train) > predicted = log_reg_fit.predict(df_test) > accuracy = accuracy_score(y_test, predicted) > kappa = cohen_kappa_score(y_test, predicted) > > return [kappa, accuracy] > > I?m still seeing: > log_run(df_save, y) > Out[7]: [-0.054421768707483005, 0.48333333333333334] > > log_run(df_save, y) > Out[8]: [0.042553191489361743, 0.55000000000000004] > > log_run(df_save, y) > Out[9]: [0.042553191489361743, 0.55000000000000004] > > log_run(df_save, y) > Out[10]: [0.027777777777777728, 0.53333333333333333] > > > Chris > >> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote: >> >> Hi, Chris, >> have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven?t done so, yet. >> >> Best, >> Sebastian >> >> >> >>> On Aug 15, 2016, at 5:27 PM, Chris Cameron wrote: >>> >>> Hi all, >>> >>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results. >>> >>> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit. >>> >>> The code I?m using: >>> >>> def log_run(logreg_x, logreg_y): >>> logreg_x['pass_fail'] = logreg_y >>> df_train, df_test = train_test_split(logreg_x, random_state=0) >>> y_train = df_train.pass_fail.as_matrix() >>> y_test = df_test.pass_fail.as_matrix() >>> del(df_train['pass_fail']) >>> del(df_test['pass_fail']) >>> log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) >>> predicted = log_reg_fit.predict(df_test) >>> accuracy = accuracy_score(y_test, predicted) >>> kappa = cohen_kappa_score(y_test, predicted) >>> >>> return [kappa, accuracy] >>> >>> >>> I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on. >>> >>> Example output: >>> --- >>> log_run(df_save, y) >>> Out[32]: [0.027777777777777728, 0.53333333333333333] >>> >>> log_run(df_save, y) >>> Out[33]: [0.027777777777777728, 0.53333333333333333] >>> >>> log_run(df_save, y) >>> Out[34]: [0.11347517730496456, 0.58333333333333337] >>> >>> log_run(df_save, y) >>> Out[35]: [0.042553191489361743, 0.55000000000000004] >>> >>> log_run(df_save, y) >>> Out[36]: [-0.07407407407407407, 0.51666666666666672] >>> >>> log_run(df_save, y) >>> Out[37]: [0.042553191489361743, 0.55000000000000004] >>> >>> A little information on the problem DataFrame: >>> --- >>> len(df_save) >>> Out[40]: 240 >>> >>> len(df_save.columns) >>> Out[41]: 18 >>> >>> >>> If I omit this particular column the Kappa no longer fluctuates: >>> >>> df_save[?abc'].head() >>> Out[42]: >>> 0 0.026316 >>> 1 0.333333 >>> 2 0.015152 >>> 3 0.010526 >>> 4 0.125000 >>> Name: abc, dtype: float64 >>> >>> >>> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed? >>> >>> >>> Thanks! >>> Chris >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From chris at upnix.com Tue Aug 16 12:15:38 2016 From: chris at upnix.com (Chris Cameron) Date: Tue, 16 Aug 2016 10:15:38 -0600 Subject: [scikit-learn] Inconsistent Logistic Regression fit results In-Reply-To: References: <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com> <903082E9-D944-4838-A882-982911520540@upnix.com> Message-ID: <34155333-560B-4A8A-87AE-7D5D76C76807@upnix.com> Thank you everyone for your help. The short version of this email is that changing the solver from ?liblinear? to ?sag? fixed my problem - but only if I upped ?max_iter? to 1000. Longer version - Without max_iter=1000, I would get the warning: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge I have some columns in my data that have a huge range of values. Using ?liblinear?, if I transformed those columns, causing the range to be smaller, the results would be consistent every time. This is the function I ended up using - def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test, y_train, y_test = train_test_split(logreg_x, logreg_y, random_state=0) del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced', tol=0.00000001, random_state=8, solver='sag', max_iter=1000).fit(df_train.values, y_train) predicted = log_reg_fit.predict(df_test.values) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted) return [kappa, accuracy] Thank you again for the help, Chris > On Aug 15, 2016, at 4:26 PM, mail at sebastianraschka.com wrote: > > hm, was worth a try. What happens if you change the solver to something else than liblinear, does this issue still persist? > > > Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably unrelated to your issue, I?d recommend setting > >> y_train = df_train.pass_fail.values >> y_test = df_test.pass_fail.values > > instead of > >> y_train = df_train.pass_fail.as_matrix() >> y_test = df_test.pass_fail.as_matrix() > > > Also, try passing NumPy arrays to the fit method: > >> log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train) > > and > >> predicted = log_reg_fit.predict(df_test.values) > > and so forth. > > > > > >> On Aug 15, 2016, at 6:00 PM, Chris Cameron wrote: >> >> Sebastian, >> >> That doesn?t do it. With the function: >> >> def log_run(logreg_x, logreg_y): >> logreg_x['pass_fail'] = logreg_y >> df_train, df_test = train_test_split(logreg_x, random_state=0) >> y_train = df_train.pass_fail.as_matrix() >> y_test = df_test.pass_fail.as_matrix() >> del(df_train['pass_fail']) >> del(df_test['pass_fail']) >> log_reg_fit = LogisticRegression(class_weight='balanced', >> tol=0.000000001, >> random_state=0).fit(df_train, y_train) >> predicted = log_reg_fit.predict(df_test) >> accuracy = accuracy_score(y_test, predicted) >> kappa = cohen_kappa_score(y_test, predicted) >> >> return [kappa, accuracy] >> >> I?m still seeing: >> log_run(df_save, y) >> Out[7]: [-0.054421768707483005, 0.48333333333333334] >> >> log_run(df_save, y) >> Out[8]: [0.042553191489361743, 0.55000000000000004] >> >> log_run(df_save, y) >> Out[9]: [0.042553191489361743, 0.55000000000000004] >> >> log_run(df_save, y) >> Out[10]: [0.027777777777777728, 0.53333333333333333] >> >> >> Chris >> >>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote: >>> >>> Hi, Chris, >>> have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven?t done so, yet. >>> >>> Best, >>> Sebastian >>> >>> >>> >>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron wrote: >>>> >>>> Hi all, >>>> >>>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results. >>>> >>>> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit. >>>> >>>> The code I?m using: >>>> >>>> def log_run(logreg_x, logreg_y): >>>> logreg_x['pass_fail'] = logreg_y >>>> df_train, df_test = train_test_split(logreg_x, random_state=0) >>>> y_train = df_train.pass_fail.as_matrix() >>>> y_test = df_test.pass_fail.as_matrix() >>>> del(df_train['pass_fail']) >>>> del(df_test['pass_fail']) >>>> log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) >>>> predicted = log_reg_fit.predict(df_test) >>>> accuracy = accuracy_score(y_test, predicted) >>>> kappa = cohen_kappa_score(y_test, predicted) >>>> >>>> return [kappa, accuracy] >>>> >>>> >>>> I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on. >>>> >>>> Example output: >>>> --- >>>> log_run(df_save, y) >>>> Out[32]: [0.027777777777777728, 0.53333333333333333] >>>> >>>> log_run(df_save, y) >>>> Out[33]: [0.027777777777777728, 0.53333333333333333] >>>> >>>> log_run(df_save, y) >>>> Out[34]: [0.11347517730496456, 0.58333333333333337] >>>> >>>> log_run(df_save, y) >>>> Out[35]: [0.042553191489361743, 0.55000000000000004] >>>> >>>> log_run(df_save, y) >>>> Out[36]: [-0.07407407407407407, 0.51666666666666672] >>>> >>>> log_run(df_save, y) >>>> Out[37]: [0.042553191489361743, 0.55000000000000004] >>>> >>>> A little information on the problem DataFrame: >>>> --- >>>> len(df_save) >>>> Out[40]: 240 >>>> >>>> len(df_save.columns) >>>> Out[41]: 18 >>>> >>>> >>>> If I omit this particular column the Kappa no longer fluctuates: >>>> >>>> df_save[?abc'].head() >>>> Out[42]: >>>> 0 0.026316 >>>> 1 0.333333 >>>> 2 0.015152 >>>> 3 0.010526 >>>> 4 0.125000 >>>> Name: abc, dtype: float64 >>>> >>>> >>>> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed? >>>> >>>> >>>> Thanks! >>>> Chris >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Wed Aug 17 03:23:12 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 17 Aug 2016 09:23:12 +0200 Subject: [scikit-learn] Inconsistent Logistic Regression fit results In-Reply-To: <34155333-560B-4A8A-87AE-7D5D76C76807@upnix.com> References: <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com> <903082E9-D944-4838-A882-982911520540@upnix.com> <34155333-560B-4A8A-87AE-7D5D76C76807@upnix.com> Message-ID: <892ed9d1-0aa3-4cfb-9454-889ddfac2e43@typeapp.com> In other words, you have an ill conditioned estimation problem, and what you were seeing were numerical instabilities due to this ill conditionning. Not a bug. An expected behavior. Sent from my phone. Please forgive brevity and mis spelling On Aug 16, 2016, 18:17, at 18:17, Chris Cameron wrote: >Thank you everyone for your help. The short version of this email is >that changing the solver from ?liblinear? to ?sag? fixed my problem - >but only if I upped ?max_iter? to 1000. > > >Longer version - >Without max_iter=1000, I would get the warning: >ConvergenceWarning: The max_iter was reached which means the coef_ did >not converge > >I have some columns in my data that have a huge range of values. Using >?liblinear?, if I transformed those columns, causing the range to be >smaller, the results would be consistent every time. > >This is the function I ended up using - >def log_run(logreg_x, logreg_y): > logreg_x['pass_fail'] = logreg_y >df_train, df_test, y_train, y_test = train_test_split(logreg_x, >logreg_y, random_state=0) > del(df_train['pass_fail']) > del(df_test['pass_fail']) > log_reg_fit = LogisticRegression(class_weight='balanced', > tol=0.00000001, > random_state=8, > solver='sag', > max_iter=1000).fit(df_train.values, y_train) > predicted = log_reg_fit.predict(df_test.values) > accuracy = accuracy_score(y_test, predicted) > kappa = cohen_kappa_score(y_test, predicted) > > return [kappa, accuracy] > > >Thank you again for the help, > >Chris > >> On Aug 15, 2016, at 4:26 PM, mail at sebastianraschka.com wrote: >> >> hm, was worth a try. What happens if you change the solver to >something else than liblinear, does this issue still persist? >> >> >> Btw. scikit-learn works with NumPy arrays, not NumPy matrices. >Probably unrelated to your issue, I?d recommend setting >> >>> y_train = df_train.pass_fail.values >>> y_test = df_test.pass_fail.values >> >> instead of >> >>> y_train = df_train.pass_fail.as_matrix() >>> y_test = df_test.pass_fail.as_matrix() >> >> >> Also, try passing NumPy arrays to the fit method: >> >>> log_reg_fit = LogisticRegression(...).fit(df_train.values, >y_train) >> >> and >> >>> predicted = log_reg_fit.predict(df_test.values) >> >> and so forth. >> >> >> >> >> >>> On Aug 15, 2016, at 6:00 PM, Chris Cameron wrote: >>> >>> Sebastian, >>> >>> That doesn?t do it. With the function: >>> >>> def log_run(logreg_x, logreg_y): >>> logreg_x['pass_fail'] = logreg_y >>> df_train, df_test = train_test_split(logreg_x, random_state=0) >>> y_train = df_train.pass_fail.as_matrix() >>> y_test = df_test.pass_fail.as_matrix() >>> del(df_train['pass_fail']) >>> del(df_test['pass_fail']) >>> log_reg_fit = LogisticRegression(class_weight='balanced', >>> tol=0.000000001, >>> random_state=0).fit(df_train, >y_train) >>> predicted = log_reg_fit.predict(df_test) >>> accuracy = accuracy_score(y_test, predicted) >>> kappa = cohen_kappa_score(y_test, predicted) >>> >>> return [kappa, accuracy] >>> >>> I?m still seeing: >>> log_run(df_save, y) >>> Out[7]: [-0.054421768707483005, 0.48333333333333334] >>> >>> log_run(df_save, y) >>> Out[8]: [0.042553191489361743, 0.55000000000000004] >>> >>> log_run(df_save, y) >>> Out[9]: [0.042553191489361743, 0.55000000000000004] >>> >>> log_run(df_save, y) >>> Out[10]: [0.027777777777777728, 0.53333333333333333] >>> >>> >>> Chris >>> >>>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote: >>>> >>>> Hi, Chris, >>>> have you set the random seed to a specific, contant integer value? >Note that the default in LogisticRegression is random_state=None. >Setting it to some arbitrary number like 123 may help if you haven?t >done so, yet. >>>> >>>> Best, >>>> Sebastian >>>> >>>> >>>> >>>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron >wrote: >>>>> >>>>> Hi all, >>>>> >>>>> Using the same X and y values >sklearn.linear_model.LogisticRegression.fit() is providing me with >inconsistent results. >>>>> >>>>> The documentation for sklearn.linear_model.LogisticRegression >states that "It is thus not uncommon, to have slightly different >results for the same input data.? I am experiencing this, however the >fix of using a smaller ?tol? parameter isn?t providing me with >consistent fit. >>>>> >>>>> The code I?m using: >>>>> >>>>> def log_run(logreg_x, logreg_y): >>>>> logreg_x['pass_fail'] = logreg_y >>>>> df_train, df_test = train_test_split(logreg_x, random_state=0) >>>>> y_train = df_train.pass_fail.as_matrix() >>>>> y_test = df_test.pass_fail.as_matrix() >>>>> del(df_train['pass_fail']) >>>>> del(df_test['pass_fail']) >>>>> log_reg_fit = >LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, >y_train) >>>>> predicted = log_reg_fit.predict(df_test) >>>>> accuracy = accuracy_score(y_test, predicted) >>>>> kappa = cohen_kappa_score(y_test, predicted) >>>>> >>>>> return [kappa, accuracy] >>>>> >>>>> >>>>> I?ve gone out of my way to be sure the test and train data is the >same for each run, so I don?t think there should be random shuffling >going on. >>>>> >>>>> Example output: >>>>> --- >>>>> log_run(df_save, y) >>>>> Out[32]: [0.027777777777777728, 0.53333333333333333] >>>>> >>>>> log_run(df_save, y) >>>>> Out[33]: [0.027777777777777728, 0.53333333333333333] >>>>> >>>>> log_run(df_save, y) >>>>> Out[34]: [0.11347517730496456, 0.58333333333333337] >>>>> >>>>> log_run(df_save, y) >>>>> Out[35]: [0.042553191489361743, 0.55000000000000004] >>>>> >>>>> log_run(df_save, y) >>>>> Out[36]: [-0.07407407407407407, 0.51666666666666672] >>>>> >>>>> log_run(df_save, y) >>>>> Out[37]: [0.042553191489361743, 0.55000000000000004] >>>>> >>>>> A little information on the problem DataFrame: >>>>> --- >>>>> len(df_save) >>>>> Out[40]: 240 >>>>> >>>>> len(df_save.columns) >>>>> Out[41]: 18 >>>>> >>>>> >>>>> If I omit this particular column the Kappa no longer fluctuates: >>>>> >>>>> df_save[?abc'].head() >>>>> Out[42]: >>>>> 0 0.026316 >>>>> 1 0.333333 >>>>> 2 0.015152 >>>>> 3 0.010526 >>>>> 4 0.125000 >>>>> Name: abc, dtype: float64 >>>>> >>>>> >>>>> Does anyone have ideas on how I can figure this out? Is there some >randomness/shuffling still going on I missed? >>>>> >>>>> >>>>> Thanks! >>>>> Chris >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Wed Aug 17 10:43:25 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 17 Aug 2016 16:43:25 +0200 Subject: [scikit-learn] 0.18? In-Reply-To: References: <577D4D41.60102@gmail.com> Message-ID: Ok I fixed all the 32 bit Linux & OSX build issues in scikit-learn master and all wheels builds are green with the multibuild setup: https://travis-ci.org/MacPython/scikit-learn-wheels Matthew: would you be interested in having the multibuild repo extended to also include appveyor configration files or do you think it's better to let projects owner do their own appveyor config by themselves? -- Olivier From matthew.brett at gmail.com Wed Aug 17 12:44:20 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 17 Aug 2016 09:44:20 -0700 Subject: [scikit-learn] 0.18? In-Reply-To: References: <577D4D41.60102@gmail.com> Message-ID: Definitely interested ! On 17 Aug 2016 07:44, "Olivier Grisel" wrote: > Ok I fixed all the 32 bit Linux & OSX build issues in scikit-learn > master and all wheels builds are green with the multibuild setup: > > https://travis-ci.org/MacPython/scikit-learn-wheels > > Matthew: would you be interested in having the multibuild repo > extended to also include appveyor configration files or do you think > it's better to let projects owner do their own appveyor config by > themselves? > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adamt at nih.gov Wed Aug 17 14:53:29 2016 From: adamt at nih.gov (Thomas, Adam (NIH/NIMH) [E]) Date: Wed, 17 Aug 2016 18:53:29 +0000 Subject: [scikit-learn] Hiring: Cloud and HPC Engineer at NIMH in Bethesda, MD Message-ID: HIRING: CLOUD AND HPC ENGINEER The National Institute of Mental Health (NIMH) is the lead federal agency for research on mental disorders. NIMH is one of the 27 Institutes and Centers that make up the National Institutes of Health (NIH), which is responsible for all federally funded biomedical research in US. NIH is part of the U.S. Department of Health and Human Services (HHS). The NIH is a highly rated employer at glassdoor.com with very competitive salary and benefits packages. The Data Science and Sharing Team (DSST) is a new group created to develop and support data sharing and other data-intensive scientific projects within the NIMH Intramural Research Program (IRP). Working closely with the Office of Data Science the goal of the DSST is to make the NIMH IRP a leader in the open science and data sharing practices mandated by the Open Data Policy released by the White House on 9 May, 2013. We are building a team to make that happen. What you?ll do? --------------- BUILD You will work with a team of researchers and developers to build and deploy neuroimaging data processing pipelines for investigators within the NIMH IRP. You will collaborate with and contribute to other projects throughout the world that are building standards and tools for open and reproducible neuroscience (e.g., NiPy, BIDS, Binder, Rstudio). You'll have the resources of the NIH HPC Cluster at your disposal as well as additional help from the AWS cloud. All tools and code will be open source and freely distributed. TEACH You will work to bolster data science skills within the NIMH IRP by teaching courses to scientists on best data practices (e.g. Software & Data Carpentry) as well as accessing and using specific neuroimaging repositories (e.g. The Human Connectome Project, OpenfMRI, UK Biobank). QUANTIFY There is no use building tools for open science if no one uses them. Part of the job of the DSST is to measure data sharing and open science practices within the NIMH IRP and progress toward their adoption. This will include bibliometrics for scientific publications from the NIMH IRP and other measures of data sharing and secondary data utilization. You will provide crucial systems level support to the team in gauging this progress. Who you are? ------------ EXPERIENCED You should be very comfortable on the command line and have a rock-solid handle on one or more Unix-based operating systems. You should have some experience with distributed, high-performance computing tools such as Spark, OpenStack, Docker/Singularity, and batch processing systems such as SLURM and SGE. You should also have experience coding in modern languages currently used in data-intensive, scientific computing such as Python, R, and Javascript, as well as interfacing with a variety of APIs. PROVEN Ideally we would like to see a recent degree (BS, MS, or PhD) in a STEM field, but if you can prove you have an equivalent amount of expertise with your publications, projects, or github/kaggle ranking, we?re all ears. We are also interviewing students and part-time staff if you?re still working on your degree. DRIVEN Data science is moving fast ? we?re looking for someone who can move faster. You should be a self-learner and a self-starter. Provide some examples of things you have worked on independently. How to apply? ------------- Email your resume, a cover letter, and a code sample that demonstrates you are all three of the above to: DATASCI-JOBSEARCH at mail.nih.gov The National Institutes of Health is an equal opportunity employer. From t3kcit at gmail.com Wed Aug 17 16:37:17 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 17 Aug 2016 16:37:17 -0400 Subject: [scikit-learn] 0.18? In-Reply-To: <69c5d973-f798-f5a1-a4a3-1ee43e2c1a36@gmail.com> References: <577D4D41.60102@gmail.com> <69c5d973-f798-f5a1-a4a3-1ee43e2c1a36@gmail.com> Message-ID: I vote we push back the release a bit (two weeks?). I had unexpected things come up that ate a bunch of my time. On 07/25/2016 03:53 PM, Andreas Mueller wrote: > Hi Olivier / all > Let me know if I can help with the builds. > I'm gonna start reviews and triaging and tagging this week. > Mid August sounds good for a beta / RC. > > It would be great if we could release in September, as that is when > The Book (aka my past year) > is scheduled to come out (I finished it last week). The Book uses > model_selection, so having > the release out before the book would be good. > > Andy > > On 07/25/2016 07:54 AM, Olivier Grisel wrote: >> Sorry for the late reply, >> >> Before working on this release I would like to automate the wheel >> generation process (for the release wheels) in a single repo that will >> generate wheels for linux, osx and windows based on >> https://github.com/matthew-brett/multibuild >> >> I plan to put that repo under >> https://github.com/scikit-learn/scikit-learn-wheels and deprecate >> https://github.com/MacPython/scikit-learn-wheels that we used for the >> OSX wheels. >> >> There is also some issue triaging to do, it would be great to identify >> blocker bugs that we would like to get fixed before releasing 0.18. >> >> We can aim to do a beta mid-August and the final release after >> euroscipy (first week of September). >> > From bertrand.thirion at inria.fr Wed Aug 17 16:57:00 2016 From: bertrand.thirion at inria.fr (bthirion) Date: Wed, 17 Aug 2016 22:57:00 +0200 Subject: [scikit-learn] 0.18? In-Reply-To: References: <577D4D41.60102@gmail.com> Message-ID: Many thanks ! Bertrand On 17/08/2016 16:43, Olivier Grisel wrote: > Ok I fixed all the 32 bit Linux & OSX build issues in scikit-learn > master and all wheels builds are green with the multibuild setup: > > https://travis-ci.org/MacPython/scikit-learn-wheels > > Matthew: would you be interested in having the multibuild repo > extended to also include appveyor configration files or do you think > it's better to let projects owner do their own appveyor config by > themselves? > From mail at sebastianraschka.com Thu Aug 18 11:44:42 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Thu, 18 Aug 2016 11:44:42 -0400 Subject: [scikit-learn] update pydata schedule Message-ID: <4EC4AB2F-EA4B-4794-AECE-97FAB72C79AA@sebastianraschka.com> From mail at sebastianraschka.com Thu Aug 18 12:51:54 2016 From: mail at sebastianraschka.com (mail at sebastianraschka.com) Date: Thu, 18 Aug 2016 12:51:54 -0400 Subject: [scikit-learn] update pydata schedule In-Reply-To: <4EC4AB2F-EA4B-4794-AECE-97FAB72C79AA@sebastianraschka.com> References: <4EC4AB2F-EA4B-4794-AECE-97FAB72C79AA@sebastianraschka.com> Message-ID: Sorry for this previous Email, please disregard. This was a reminder to myself and I somehow sent it to the wrong recipient. Sent from my iPhone > On Aug 18, 2016, at 11:44 AM, Sebastian Raschka wrote: > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From shanglunwang at gmail.com Fri Aug 19 19:59:06 2016 From: shanglunwang at gmail.com (Shanglun Wang) Date: Fri, 19 Aug 2016 19:59:06 -0400 Subject: [scikit-learn] Help with improving t-sne Message-ID: Hello, I am currently working on a ticket on github involving improving the data structures powering t-sne. I am running into some trouble trying to conceptually link up what the code is doing and the underlying mathematical theory. Normally I would just grapple with it, but I feel like I would need some help to get this ticket done in a reasonable time frame. Would someone be willing to help me understand the theory underpinning t-sne, and how that links up with the implementation? Thank you, Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From brookm291 at gmail.com Sun Aug 21 05:33:35 2016 From: brookm291 at gmail.com (KevNo) Date: Sun, 21 Aug 2016 18:33:35 +0900 Subject: [scikit-learn] Building Scikit Learn in Win 7 64bits Message-ID: <57B9756F.40100@gmail.com> Hi, I follow the instructions here: to compile Scikit-learn for development https://github.com/scikit-learn/scikit-learn/issues/5709 1) Download from Git into a repository scikit_learn 2) Add the path to Python Path 3) In IPython !!python D:\_devs\Python01\scikit_learn\sklearn\setup.py build_ext --inplace and I got this message: |Assuming default configuration (scikit_learn\\sklearn\\svm\\tests/{setup_tests,setup}.py was not found)D:\\\\_devs\\\\Python01\\\\scikit_learn\\\\sklearn\\\\setup.py:71: UserWarning: ', ' Blas (http://www.netlib.org/blas/) libraries not found.', ' Directories to search for the libraries can be specified in the', ' numpy/distutils/site.cfg file (section [blas]) or by setting', ' the BLAS environment variable.', ' warnings.warn(BlasNotFoundError.__doc__)', 'Warning: Assuming default configuration (scikit_learn\\sklearn\\linear_model\\tests/{setup_tests,setup}.py was not found)Warning: Assuming default configuration (scikit_learn\\sklearn\\utils\\sparsetools\\tests/{setup_tests,setup}.py was not found)Warning: Assuming default configuration (scikit_learn\\sklearn\\utils\\tests/{setup_tests,setup}.py was not found)Warning: Assuming default configuration (scikit_learn\\sklearn\\tests/{setup_tests,setup}.py was not found)gcc.exe: error: _check_build.c: No such file or directory', 'gcc.exe: fatal error: no input files', 'compilation terminated.', 'error: Command "gcc -m64 -g -DNDEBUG -DMS_WIN64 -O2 -Wall -Wstrict-prototypes -D__MSVCRT_VERSION__=0x0900 -ID:\\_devs\\Python01\\WinPython-64-2710\\python-2.7.10.amd64\\lib\\site-packages\\numpy\\core\\include -ID:\\_devs\\Python01\\WinPython-64-2710\\python-2.7.10.amd64\\lib\\site-packages\\numpy\\core\\include -ID:\\_devs\\Python01\\WinPython-64-2710\\python-2.7.10.amd64\\include -ID:\\_devs\\Python01\\WinPython-64-2710\\python-2.7.10.amd64\\PC -c _check_build.c -o build\\temp.win-amd64-2.7\\Release\\_check_build.o" failed with exit status 1']| Environnment WinPython 2.7 64bits Windows-7-6.1.7601-SP1 ('Python', '2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)]') ('NumPy', '1.9.2') ('SciPy', '0.16.0' ) Just wondering if possible to get a place where we can compile without spending hours and days on the web to find the issues ? Thanks Brook -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Aug 22 05:43:21 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 22 Aug 2016 11:43:21 +0200 Subject: [scikit-learn] Building Scikit Learn in Win 7 64bits In-Reply-To: <57B9756F.40100@gmail.com> References: <57B9756F.40100@gmail.com> Message-ID: The error message mentions gcc. Have you installed some mingw version? As of now our windows build is only properly tested with the Visual Studio C++ compiler from appveyor: https://ci.appveyor.com/project/sklearn-ci/scikit-learn I have not tested the build with mingwpy in a while (I am not a windows user my-self). The file not found error makes me think that you might need to cd into the scikit-learn source folder: !!cd D:\_devs\Python01\scikit_learn\sklearn !!python setup.py build_ext --inplace -- Olivier From joel.nothman at gmail.com Mon Aug 22 06:22:23 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 22 Aug 2016 20:22:23 +1000 Subject: [scikit-learn] Building Scikit Learn in Win 7 64bits In-Reply-To: References: <57B9756F.40100@gmail.com> Message-ID: You could also use ! pip install D:\_devs\Python01\scikit_learn\sklearn or indeed ! pip install git+https://github.com/scikit-learn/scikit-learn/ if you don't actually want to use the directory with the source code in it. On 22 August 2016 at 19:43, Olivier Grisel wrote: > The error message mentions gcc. Have you installed some mingw version? > > As of now our windows build is only properly tested with the Visual > Studio C++ compiler from appveyor: > > https://ci.appveyor.com/project/sklearn-ci/scikit-learn > > I have not tested the build with mingwpy in a while (I am not a > windows user my-self). > > The file not found error makes me think that you might need to cd into > the scikit-learn source folder: > > !!cd D:\_devs\Python01\scikit_learn\sklearn > !!python setup.py build_ext --inplace > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Aug 22 09:33:07 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 22 Aug 2016 15:33:07 +0200 Subject: [scikit-learn] 0.18? In-Reply-To: References: <577D4D41.60102@gmail.com> <69c5d973-f798-f5a1-a4a3-1ee43e2c1a36@gmail.com> Message-ID: Ok for pushing back. Let's try to work on the beta on the week after euroscipy if we can. At least all the annoying binary packaging issues are fixed (test failures for the linux and OSX 32 bit platforms) so the release process itself should hopefully be painless. -- Olivier From t3kcit at gmail.com Mon Aug 22 14:30:51 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 22 Aug 2016 14:30:51 -0400 Subject: [scikit-learn] Help with improving t-sne In-Reply-To: References: Message-ID: <9cfda285-e073-851f-f4dd-963a76b62a9b@gmail.com> Hi Sean. Thanks for working on this. Do you have any more specific questions? Have you looked at the barnes-hut paper? Cheers, Andy On 08/19/2016 07:59 PM, Shanglun Wang wrote: > > Hello, > > I am currently working on a ticket on github involving improving the > data structures powering t-sne. I am running into some trouble trying > to conceptually link up what the code is doing and the underlying > mathematical theory. Normally I would just grapple with it, but I feel > like I would need some help to get this ticket done in a reasonable > time frame. > > Would someone be willing to help me understand the theory underpinning > t-sne, and how that links up with the implementation? > > Thank you, > > Sean > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From brookm291 at gmail.com Tue Aug 23 13:23:34 2016 From: brookm291 at gmail.com (KevNo) Date: Wed, 24 Aug 2016 02:23:34 +0900 Subject: [scikit-learn] Building Scikit Learn in Win 7 64bits In-Reply-To: References: Message-ID: <57BC8696.3080206@gmail.com> Hello, Thanks for yoru advice/reply. I tried to build from VC after following Instructions of VS Python 2.7 compiler. Steps ares: |1) Git download 2) VC++ for Python: https://blogs.msdn.microsoft.com/pythonengineering/2016/04/11/unable-to-find-vcvarsall-bat/#comment-515 3) Change thePath for compiler VStudio 4) in|!!cd D:\_devs\Python01\scikit_learn\sklearn (folder of sklearn) |5) python setup.py build | I have this message: |building'sklearn.__check_build._check_build' extension compiling C sources cl.exe/c/nologo/Ox /MD/W3/GS- /DNDEBUG-ID:\_devs\Python01\Anaconda2\lib\si te-packages\numpy\core\include-ID:\_devs\Python01\Anaconda2\lib\site-packages\n umpy\core\include-ID:\_devs\Python01\Anaconda2\include-ID:\_devs\Python01\Anac onda2\PC/Tc_check_build.c/Fobuild\temp.win-amd64-2.7\Release\_check_build.obj Found executable C:\Users\asus1\AppData\Local\Programs\Common\Microsoft\Visual C ++ for Python\9.0\VC\Bin\amd64\cl.exe _check_build.c c1: fatal error C1083: Cannot open source file: '_check_build.c': No such file or directory| I dont have any idea where it could come from Thanks Brook > scikit-learn-request at python.org > Tuesday, August 23, 2016 1:00 AM > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Building Scikit Learn in Win 7 64bits (Olivier Grisel) > 2. Re: Building Scikit Learn in Win 7 64bits (Joel Nothman) > 3. Re: 0.18? (Olivier Grisel) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 22 Aug 2016 11:43:21 +0200 > From: Olivier Grisel > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] Building Scikit Learn in Win 7 64bits > Message-ID: > > Content-Type: text/plain; charset=UTF-8 > > The error message mentions gcc. Have you installed some mingw version? > > As of now our windows build is only properly tested with the Visual > Studio C++ compiler from appveyor: > > https://ci.appveyor.com/project/sklearn-ci/scikit-learn > > I have not tested the build with mingwpy in a while (I am not a > windows user my-self). > > The file not found error makes me think that you might need to cd into > the scikit-learn source folder: > > !!cd D:\_devs\Python01\scikit_learn\sklearn > !!python setup.py build_ext --inplace > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: compose-unknown-contact.jpg Type: image/jpeg Size: 770 bytes Desc: not available URL: From aadral at gmail.com Tue Aug 23 16:24:30 2016 From: aadral at gmail.com (=?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JTRgNCw0LvRjA==?=) Date: Tue, 23 Aug 2016 21:24:30 +0100 Subject: [scikit-learn] GradientBoostingRegressor, question about initialisation with MeanEstimator Message-ID: Hi there, I recently found out that GradientBoostingRegressor uses MeanEstimator for the initial estimator in ensemble. Could you please point out (or explain) to the research showing superiority of this approach compared to the usage of DecisionTreeRegressor? -- Yours sincerely, Alexey A. Dral -------------- next part -------------- An HTML attachment was scrubbed... URL: From siddhantloya2008 at gmail.com Fri Aug 26 03:08:55 2016 From: siddhantloya2008 at gmail.com (Siddhant Loya) Date: Fri, 26 Aug 2016 12:38:55 +0530 Subject: [scikit-learn] Fitting a plane to a 3D points Cloud Message-ID: I have been trying to use Ransac to fit a plane to a 3D point cloud. I am not able to understand on how to do this on 3D data. I have already posted a question on SO. Link :- http://stackoverflow.com/questions/39159102/fit-a-plane-to-3d-point-cloud-using-ransac?noredirect=1#comment65663410_39159102 I am not able to understand how to solve this for a plane instead of 2-D line. Regards, Siddhant -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Fri Aug 26 10:09:15 2016 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Fri, 26 Aug 2016 16:09:15 +0200 Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD Message-ID: <57C04D8B.40001@gmail.com> Hi all, I have a question about using the TruncatedSVD method for performing Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply applying TruncatedSVD to a tf-idf matrice is sufficient (cf. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html), but I'm wondering about that. As far as I understood for LSA one computes a truncated SVD decomposition of the tf-idf matrix X (n_features x n_samples), X ? U @ Sigma @ V.T and then for a document vector d, the projection is computed as, d_proj = d.T @ U @ Sigma?? (source: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf) However, TruncatedSVD.fit_transform only computes, d_proj = d.T @ U and what's more does not store the singular values (Sigma) internally, so it cannot be easily applied afterwards. (the above notation are transposed with respect to those in the scikit learn docs). For instance, I have tried reproducing LSA decomposition from literature and I'm not getting the expected results unless I perform an additional normalization by the Sigma matrix: https://gist.github.com/rth/3af30c60bece7db4207821a6dddc5e8d I was wondering if I am missing something here? Thank you, -- Roman From t3kcit at gmail.com Fri Aug 26 10:55:41 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 26 Aug 2016 10:55:41 -0400 Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD In-Reply-To: <57C04D8B.40001@gmail.com> References: <57C04D8B.40001@gmail.com> Message-ID: <532083f1-0647-989d-6f35-2a83176199ea@gmail.com> Looks like they apply whitening, which is not implemented in TruncatedSVD. I guess we could add that option. It's equivalent to using a StandardScaler after the TruncatedSVD. Can you try and see if that reproduces the results? On 08/26/2016 10:09 AM, Roman Yurchak wrote: > Hi all, > > I have a question about using the TruncatedSVD method for performing > Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply > applying TruncatedSVD to a tf-idf matrice is sufficient (cf. > http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html), > but I'm wondering about that. > > As far as I understood for LSA one computes a truncated SVD > decomposition of the tf-idf matrix X (n_features x n_samples), > X ? U @ Sigma @ V.T > and then for a document vector d, the projection is computed as, > d_proj = d.T @ U @ Sigma?? > (source: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf) > However, TruncatedSVD.fit_transform only computes, > d_proj = d.T @ U > and what's more does not store the singular values (Sigma) internally, > so it cannot be easily applied afterwards. > (the above notation are transposed with respect to those in the scikit > learn docs). > > For instance, I have tried reproducing LSA decomposition from literature > and I'm not getting the expected results unless I perform an additional > normalization by the Sigma matrix: > https://gist.github.com/rth/3af30c60bece7db4207821a6dddc5e8d > > I was wondering if I am missing something here? > Thank you, From elgesto at gmail.com Sat Aug 27 05:33:19 2016 From: elgesto at gmail.com (elgesto at gmail.com) Date: Sat, 27 Aug 2016 12:33:19 +0300 Subject: [scikit-learn] Fwd: inconsistency between libsvm and scikit-learn.svc results In-Reply-To: References: Message-ID: I have a project that is based on SVM algorithm implemented by libsvm . Recently I decided to try several other classification algorithm, this is where scikit-learn comes to the picture. The connection to the scikit was pretty straightforward, it supports libsvm format by load_svmlight_file routine. Ans it's svm implementation is based on the same libsvm. When everything was done, I decided to the check the consistence of the results by directly running libsvm and via scikit-learn, and the results were different. Among 18 measures in learning curves, 7 were different, and the difference is located at the small steps of the learning curve. The libsvm results seems much more stable, but scikit-learn results have some drastic fluctuation. The classifiers have exactly the same parameters of course. I tried to check the version of libsvm in scikit-learn implementation, but I din't find it, the only thing I found was libsvm.so file. Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1 version. I wound appreciate any help in addressing this issue. size libsvm scikit-learn 1 0.1336239435355727 0.1336239435355727 2 0.08699516468193455 0.08699516468193455 3 0.32928301642777424 0.2117238289550198 #different 4 0.2835688734876902 0.2835688734876902 5 0.27846766962743097 0.26651875338163966 #different 6 0.2853854654662907 0.18898048915599963 #different 7 0.28196058132165136 0.28196058132165136 8 0.31473956032575623 0.1958710201604552 #different 9 0.33588303670653136 0.2101641630182972 #different 10 0.4075242509025311 0.2997807499800962 #different 15 0.4391771087975972 0.4391771087975972 20 0.3837789445609818 0.2713167833345173 #different 25 0.4252154334940311 0.4252154334940311 30 0.4256407777477492 0.4256407777477492 35 0.45314944605858387 0.45314944605858387 40 0.4278633233755064 0.4278633233755064 45 0.46174762022239796 0.46174762022239796 50 0.45370452524846866 0.45370452524846866 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olologin at gmail.com Sat Aug 27 05:49:26 2016 From: olologin at gmail.com (olologin) Date: Sat, 27 Aug 2016 12:49:26 +0300 Subject: [scikit-learn] Fwd: inconsistency between libsvm and scikit-learn.svc results In-Reply-To: References: Message-ID: On 08/27/2016 12:33 PM, elgesto at gmail.com wrote: > > I have a project that is based on SVM algorithm implemented by libsvm > . Recently I decided to > try several other classification algorithm, this is where scikit-learn > comes to the picture. > > The connection to the scikit was pretty straightforward, it supports > libsvm format by |load_svmlight_file| routine. Ans it's svm > implementation is based on the same libsvm. > > When everything was done, I decided to the check the consistence of > the results by directly running libsvm and via scikit-learn, and the > results were different. Among 18 measures in learning curves, 7 were > different, and the difference is located at the small steps of the > learning curve. The libsvm results seems much more stable, but > scikit-learn results have some drastic fluctuation. > > The classifiers have exactly the same parameters of course. I tried to > check the version of libsvm in scikit-learn implementation, but I > din't find it, the only thing I found was libsvm.so file. > > Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1 version. > > I wound appreciate any help in addressing this issue. > > > |size libsvm scikit-learn 1 0.1336239435355727 0.1336239435355727 2 > 0.08699516468193455 0.08699516468193455 3 0.32928301642777424 > 0.2117238289550198 #different 4 0.2835688734876902 0.2835688734876902 > 5 0.27846766962743097 0.26651875338163966 #different 6 > 0.2853854654662907 0.18898048915599963 #different 7 > 0.28196058132165136 0.28196058132165136 8 0.31473956032575623 > 0.1958710201604552 #different 9 0.33588303670653136 0.2101641630182972 > #different 10 0.4075242509025311 0.2997807499800962 #different 15 > 0.4391771087975972 0.4391771087975972 20 0.3837789445609818 > 0.2713167833345173 #different 25 0.4252154334940311 0.4252154334940311 > 30 0.4256407777477492 0.4256407777477492 35 0.45314944605858387 > 0.45314944605858387 40 0.4278633233755064 0.4278633233755064 45 > 0.46174762022239796 0.46174762022239796 50 0.45370452524846866 > 0.45370452524846866| > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn This might be because current version of libsvm used in scikit is 3.10 from 2011. With some patch imported from upstream. -------------- next part -------------- An HTML attachment was scrubbed... URL: From elgesto at gmail.com Sat Aug 27 07:19:28 2016 From: elgesto at gmail.com (elgesto at gmail.com) Date: Sat, 27 Aug 2016 14:19:28 +0300 Subject: [scikit-learn] Fwd: inconsistency between libsvm and scikit-learn.svc results In-Reply-To: References: Message-ID: Can I update the libsvm version by myself? 2016-08-27 12:49 GMT+03:00 olologin : > On 08/27/2016 12:33 PM, elgesto at gmail.com wrote: > > I have a project that is based on SVM algorithm implemented by libsvm > . Recently I decided to try > several other classification algorithm, this is where scikit-learn > comes to the picture. > > The connection to the scikit was pretty straightforward, it supports > libsvm format by load_svmlight_file routine. Ans it's svm implementation > is based on the same libsvm. > > When everything was done, I decided to the check the consistence of the > results by directly running libsvm and via scikit-learn, and the results > were different. Among 18 measures in learning curves, 7 were different, and > the difference is located at the small steps of the learning curve. The > libsvm results seems much more stable, but scikit-learn results have some > drastic fluctuation. > > The classifiers have exactly the same parameters of course. I tried to > check the version of libsvm in scikit-learn implementation, but I din't > find it, the only thing I found was libsvm.so file. > > Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1 version. > > I wound appreciate any help in addressing this issue. > > > size libsvm scikit-learn > 1 0.1336239435355727 0.1336239435355727 > 2 0.08699516468193455 0.08699516468193455 > 3 0.32928301642777424 0.2117238289550198 #different > 4 0.2835688734876902 0.2835688734876902 > 5 0.27846766962743097 0.26651875338163966 #different > 6 0.2853854654662907 0.18898048915599963 #different > 7 0.28196058132165136 0.28196058132165136 > 8 0.31473956032575623 0.1958710201604552 #different > 9 0.33588303670653136 0.2101641630182972 #different > 10 0.4075242509025311 0.2997807499800962 #different > 15 0.4391771087975972 0.4391771087975972 > 20 0.3837789445609818 0.2713167833345173 #different > 25 0.4252154334940311 0.4252154334940311 > 30 0.4256407777477492 0.4256407777477492 > 35 0.45314944605858387 0.45314944605858387 > 40 0.4278633233755064 0.4278633233755064 > 45 0.46174762022239796 0.46174762022239796 > 50 0.45370452524846866 0.45370452524846866 > > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > This might be because current version of libsvm used in scikit is 3.10 > from 2011. With some patch imported from upstream. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olologin at gmail.com Sat Aug 27 08:36:56 2016 From: olologin at gmail.com (olologin) Date: Sat, 27 Aug 2016 15:36:56 +0300 Subject: [scikit-learn] Fwd: inconsistency between libsvm and scikit-learn.svc results In-Reply-To: References: Message-ID: <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com> On 08/27/2016 02:19 PM, elgesto at gmail.com wrote: > Can I update the libsvm version by myself? > > 2016-08-27 12:49 GMT+03:00 olologin >: > > On 08/27/2016 12:33 PM, elgesto at gmail.com > wrote: >> >> I have a project that is based on SVM algorithm implemented by >> libsvm . Recently I >> decided to try several other classification algorithm, this is >> where scikit-learn comes to the picture. >> >> The connection to the scikit was pretty straightforward, it >> supports libsvm format by |load_svmlight_file| routine. Ans it's >> svm implementation is based on the same libsvm. >> >> When everything was done, I decided to the check the consistence >> of the results by directly running libsvm and via scikit-learn, >> and the results were different. Among 18 measures in learning >> curves, 7 were different, and the difference is located at the >> small steps of the learning curve. The libsvm results seems much >> more stable, but scikit-learn results have some drastic fluctuation. >> >> The classifiers have exactly the same parameters of course. I >> tried to check the version of libsvm in scikit-learn >> implementation, but I din't find it, the only thing I found was >> libsvm.so file. >> >> Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1 >> version. >> >> I wound appreciate any help in addressing this issue. >> >> >> |size libsvm scikit-learn 1 0.1336239435355727 0.1336239435355727 >> 2 0.08699516468193455 0.08699516468193455 3 0.32928301642777424 >> 0.2117238289550198 #different 4 0.2835688734876902 >> 0.2835688734876902 5 0.27846766962743097 0.26651875338163966 >> #different 6 0.2853854654662907 0.18898048915599963 #different 7 >> 0.28196058132165136 0.28196058132165136 8 0.31473956032575623 >> 0.1958710201604552 #different 9 0.33588303670653136 >> 0.2101641630182972 #different 10 0.4075242509025311 >> 0.2997807499800962 #different 15 0.4391771087975972 >> 0.4391771087975972 20 0.3837789445609818 0.2713167833345173 >> #different 25 0.4252154334940311 0.4252154334940311 30 >> 0.4256407777477492 0.4256407777477492 35 0.45314944605858387 >> 0.45314944605858387 40 0.4278633233755064 0.4278633233755064 45 >> 0.46174762022239796 0.46174762022239796 50 0.45370452524846866 >> 0.45370452524846866| >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > This might be because current version of libsvm used in scikit is > 3.10 from 2011. With some patch imported from upstream. > > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn I don't think it is so easy, version which is used in scikit-learn has many additional modifications. from header of svm.cpp: /* Modified 2010: - Support for dense data by Ming-Fang Weng - Return indices for support vectors, Fabian Pedregosa - Fixes to avoid name collision, Fabian Pedregosa - Add support for instance weights, Fabian Pedregosa based on work by Ming-Wei Chang, Hsuan-Tien Lin, Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu, . - Make labels sorted in svm_group_classes, Fabian Pedregosa. */ -------------- next part -------------- An HTML attachment was scrubbed... URL: From elgesto at gmail.com Sat Aug 27 09:42:20 2016 From: elgesto at gmail.com (elgesto at gmail.com) Date: Sat, 27 Aug 2016 16:42:20 +0300 Subject: [scikit-learn] Fwd: inconsistency between libsvm and scikit-learn.svc results In-Reply-To: <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com> References: <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com> Message-ID: So there is no possibility to reach a consistency? 2016-08-27 15:36 GMT+03:00 olologin : > On 08/27/2016 02:19 PM, elgesto at gmail.com wrote: > > Can I update the libsvm version by myself? > > 2016-08-27 12:49 GMT+03:00 olologin : > >> On 08/27/2016 12:33 PM, elgesto at gmail.com wrote: >> >> I have a project that is based on SVM algorithm implemented by libsvm >> . Recently I decided to >> try several other classification algorithm, this is where scikit-learn >> comes to the picture. >> >> The connection to the scikit was pretty straightforward, it supports >> libsvm format by load_svmlight_file routine. Ans it's svm implementation >> is based on the same libsvm. >> >> When everything was done, I decided to the check the consistence of the >> results by directly running libsvm and via scikit-learn, and the results >> were different. Among 18 measures in learning curves, 7 were different, and >> the difference is located at the small steps of the learning curve. The >> libsvm results seems much more stable, but scikit-learn results have some >> drastic fluctuation. >> >> The classifiers have exactly the same parameters of course. I tried to >> check the version of libsvm in scikit-learn implementation, but I din't >> find it, the only thing I found was libsvm.so file. >> >> Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1 version. >> >> I wound appreciate any help in addressing this issue. >> >> >> size libsvm scikit-learn >> 1 0.1336239435355727 0.1336239435355727 >> 2 0.08699516468193455 0.08699516468193455 >> 3 0.32928301642777424 0.2117238289550198 #different >> 4 0.2835688734876902 0.2835688734876902 >> 5 0.27846766962743097 0.26651875338163966 #different >> 6 0.2853854654662907 0.18898048915599963 #different >> 7 0.28196058132165136 0.28196058132165136 >> 8 0.31473956032575623 0.1958710201604552 #different >> 9 0.33588303670653136 0.2101641630182972 #different >> 10 0.4075242509025311 0.2997807499800962 #different >> 15 0.4391771087975972 0.4391771087975972 >> 20 0.3837789445609818 0.2713167833345173 #different >> 25 0.4252154334940311 0.4252154334940311 >> 30 0.4256407777477492 0.4256407777477492 >> 35 0.45314944605858387 0.45314944605858387 >> 40 0.4278633233755064 0.4278633233755064 >> 45 0.46174762022239796 0.46174762022239796 >> 50 0.45370452524846866 0.45370452524846866 >> >> >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> This might be because current version of libsvm used in scikit is 3.10 >> from 2011. With some patch imported from upstream. >> _______________________________________________ scikit-learn mailing >> list scikit-learn at python.org https://mail.python.org/mailma >> n/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > I don't think it is so easy, version which is used in scikit-learn has > many additional modifications. > > from header of svm.cpp: /* Modified 2010: - Support for dense data > by Ming-Fang Weng - Return indices for support vectors, Fabian Pedregosa > - Fixes > to avoid name collision, Fabian Pedregosa - Add support for instance > weights, Fabian Pedregosa based on work by Ming-Wei Chang, Hsuan-Tien > Lin, Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu, > for_data_instances> > . > - Make labels sorted in svm_group_classes, Fabian Pedregosa. */ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Aug 27 09:48:17 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 27 Aug 2016 23:48:17 +1000 Subject: [scikit-learn] Fwd: inconsistency between libsvm and scikit-learn.svc results In-Reply-To: References: <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com> Message-ID: I don't think we should assume that this is the only possible reason for inconsistency. Could you give us a small snippet of data and code on which you find this inconsistency? On 27 August 2016 at 23:42, elgesto at gmail.com wrote: > So there is no possibility to reach a consistency? > > 2016-08-27 15:36 GMT+03:00 olologin : > >> On 08/27/2016 02:19 PM, elgesto at gmail.com wrote: >> >> Can I update the libsvm version by myself? >> >> 2016-08-27 12:49 GMT+03:00 olologin : >> >>> On 08/27/2016 12:33 PM, elgesto at gmail.com wrote: >>> >>> I have a project that is based on SVM algorithm implemented by libsvm >>> . Recently I decided to >>> try several other classification algorithm, this is where scikit-learn >>> comes to the picture. >>> >>> The connection to the scikit was pretty straightforward, it supports >>> libsvm format by load_svmlight_file routine. Ans it's svm >>> implementation is based on the same libsvm. >>> >>> When everything was done, I decided to the check the consistence of the >>> results by directly running libsvm and via scikit-learn, and the results >>> were different. Among 18 measures in learning curves, 7 were different, and >>> the difference is located at the small steps of the learning curve. The >>> libsvm results seems much more stable, but scikit-learn results have some >>> drastic fluctuation. >>> >>> The classifiers have exactly the same parameters of course. I tried to >>> check the version of libsvm in scikit-learn implementation, but I din't >>> find it, the only thing I found was libsvm.so file. >>> >>> Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1 >>> version. >>> >>> I wound appreciate any help in addressing this issue. >>> >>> >>> size libsvm scikit-learn >>> 1 0.1336239435355727 0.1336239435355727 >>> 2 0.08699516468193455 0.08699516468193455 >>> 3 0.32928301642777424 0.2117238289550198 #different >>> 4 0.2835688734876902 0.2835688734876902 >>> 5 0.27846766962743097 0.26651875338163966 #different >>> 6 0.2853854654662907 0.18898048915599963 #different >>> 7 0.28196058132165136 0.28196058132165136 >>> 8 0.31473956032575623 0.1958710201604552 #different >>> 9 0.33588303670653136 0.2101641630182972 #different >>> 10 0.4075242509025311 0.2997807499800962 #different >>> 15 0.4391771087975972 0.4391771087975972 >>> 20 0.3837789445609818 0.2713167833345173 #different >>> 25 0.4252154334940311 0.4252154334940311 >>> 30 0.4256407777477492 0.4256407777477492 >>> 35 0.45314944605858387 0.45314944605858387 >>> 40 0.4278633233755064 0.4278633233755064 >>> 45 0.46174762022239796 0.46174762022239796 >>> 50 0.45370452524846866 0.45370452524846866 >>> >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>> >>> This might be because current version of libsvm used in scikit is 3.10 >>> from 2011. With some patch imported from upstream. >>> _______________________________________________ scikit-learn mailing >>> list scikit-learn at python.org https://mail.python.org/mailma >>> n/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> I don't think it is so easy, version which is used in scikit-learn has >> many additional modifications. >> >> from header of svm.cpp: /* Modified 2010: - Support for dense data >> by Ming-Fang Weng - Return indices for support vectors, Fabian Pedregosa >> - Fixes >> to avoid name collision, Fabian Pedregosa - Add support for instance >> weights, Fabian Pedregosa based on work by Ming-Wei Chang, Hsuan-Tien >> Lin, Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu, >> > data_instances> >> . >> - Make labels sorted in svm_group_classes, Fabian Pedregosa. */ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at cgl.ucsf.edu Sat Aug 27 09:55:10 2016 From: ross at cgl.ucsf.edu (Bill Ross) Date: Sat, 27 Aug 2016 06:55:10 -0700 Subject: [scikit-learn] Fwd: inconsistency between libsvm and scikit-learn.svc results In-Reply-To: References: <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com> Message-ID: <239c735a-fb69-f2a4-6dc8-235dca562bf2@cgl.ucsf.edu> One logical possibility is if svm would accept the scikit-learn changes. On 8/27/16 6:42 AM, elgesto at gmail.com wrote: > So there is no possibility to reach a consistency? > > 2016-08-27 15:36 GMT+03:00 olologin >: > > On 08/27/2016 02:19 PM, elgesto at gmail.com > wrote: >> Can I update the libsvm version by myself? >> >> 2016-08-27 12:49 GMT+03:00 olologin > >: >> >> On 08/27/2016 12:33 PM, elgesto at gmail.com >> wrote: >>> >>> I have a project that is based on SVM algorithm implemented >>> by libsvm . >>> Recently I decided to try several other classification >>> algorithm, this is where scikit-learn >>> comes to the picture. >>> >>> The connection to the scikit was pretty straightforward, it >>> supports libsvm format by |load_svmlight_file| routine. Ans >>> it's svm implementation is based on the same libsvm. >>> >>> When everything was done, I decided to the check the >>> consistence of the results by directly running libsvm and >>> via scikit-learn, and the results were different. Among 18 >>> measures in learning curves, 7 were different, and the >>> difference is located at the small steps of the learning >>> curve. The libsvm results seems much more stable, but >>> scikit-learn results have some drastic fluctuation. >>> >>> The classifiers have exactly the same parameters of course. >>> I tried to check the version of libsvm in scikit-learn >>> implementation, but I din't find it, the only thing I found >>> was libsvm.so file. >>> >>> Currently I am using libsvm 3.21 version, and scikit-learn >>> 0.17.1 version. >>> >>> I wound appreciate any help in addressing this issue. >>> >>> >>> |size libsvm scikit-learn 1 0.1336239435355727 >>> 0.1336239435355727 2 0.08699516468193455 0.08699516468193455 >>> 3 0.32928301642777424 0.2117238289550198 #different 4 >>> 0.2835688734876902 0.2835688734876902 5 0.27846766962743097 >>> 0.26651875338163966 #different 6 0.2853854654662907 >>> 0.18898048915599963 #different 7 0.28196058132165136 >>> 0.28196058132165136 8 0.31473956032575623 0.1958710201604552 >>> #different 9 0.33588303670653136 0.2101641630182972 >>> #different 10 0.4075242509025311 0.2997807499800962 >>> #different 15 0.4391771087975972 0.4391771087975972 20 >>> 0.3837789445609818 0.2713167833345173 #different 25 >>> 0.4252154334940311 0.4252154334940311 30 0.4256407777477492 >>> 0.4256407777477492 35 0.45314944605858387 >>> 0.45314944605858387 40 0.4278633233755064 0.4278633233755064 >>> 45 0.46174762022239796 0.46174762022239796 50 >>> 0.45370452524846866 0.45370452524846866| >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> This might be because current version of libsvm used in >> scikit is 3.10 from 2011. With some patch imported from >> upstream. >> >> _______________________________________________ scikit-learn >> mailing list scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > I don't think it is so easy, version which is used in scikit-learn > has many additional modifications. > > from header of svm.cpp: /* Modified 2010: - Support for > dense data by Ming-Fang Weng - Return indices for support > vectors, Fabian Pedregosa > - Fixes to avoid name > collision, Fabian Pedregosa - Add support for instance weights, > Fabian Pedregosa based on work by Ming-Wei Chang, Hsuan-Tien > Lin, Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu, > > . > - Make labels sorted in svm_group_classes, Fabian Pedregosa. */ > > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Sat Aug 27 12:20:31 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Sat, 27 Aug 2016 18:20:31 +0200 Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD In-Reply-To: <532083f1-0647-989d-6f35-2a83176199ea@gmail.com> References: <57C04D8B.40001@gmail.com> <532083f1-0647-989d-6f35-2a83176199ea@gmail.com> Message-ID: I am not sure this is exactly the same because we do not center the data in the TruncatedSVD case (as opposed to the real PCA case where whitening is the same as calling StandardScaler). Having an option to normalize the transformed data by sigma seems like a good idea but we should probably not call that whitening. -- Olivier From olivier.grisel at ensta.org Sat Aug 27 12:37:37 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Sat, 27 Aug 2016 18:37:37 +0200 Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD In-Reply-To: References: <57C04D8B.40001@gmail.com> <532083f1-0647-989d-6f35-2a83176199ea@gmail.com> Message-ID: BTW Roman, the examples in your gist would make a great non-regression test for this new feature. Please feel free to submit a PR. -- Olivier From mathieu at mblondel.org Sun Aug 28 00:30:05 2016 From: mathieu at mblondel.org (Mathieu Blondel) Date: Sun, 28 Aug 2016 13:30:05 +0900 Subject: [scikit-learn] GradientBoostingRegressor, question about initialisation with MeanEstimator In-Reply-To: References: Message-ID: This comes from Algorithm 1, line 1, in "Greedy Function Approximation: a Gradient Boosting Machine" by J. Friedman. Intuitively, this has the same effect as fitting a bias (intercept) term in a linear model. This allows the subsequent iterations (decision trees) to work with centered targets. Mathieu On Wed, Aug 24, 2016 at 5:24 AM, ??????? ????? wrote: > Hi there, > > I recently found out that GradientBoostingRegressor uses MeanEstimator for > the initial estimator in ensemble. Could you please point out (or > explain) to the research showing superiority of this approach compared to > the usage of DecisionTreeRegressor? > > -- > Yours sincerely, > Alexey A. Dral > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aadral at gmail.com Sun Aug 28 04:57:28 2016 From: aadral at gmail.com (=?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JTRgNCw0LvRjA==?=) Date: Sun, 28 Aug 2016 09:57:28 +0100 Subject: [scikit-learn] GradientBoostingRegressor, question about initialisation with MeanEstimator In-Reply-To: References: Message-ID: Hi Mathieu, I was looking exactly for this article. Thank you very much. 2016-08-28 5:30 GMT+01:00 Mathieu Blondel : > This comes from Algorithm 1, line 1, in "Greedy Function Approximation: a > Gradient Boosting Machine" by J. Friedman. > > Intuitively, this has the same effect as fitting a bias (intercept) term > in a linear model. This allows the subsequent iterations (decision trees) > to work with centered targets. > > Mathieu > > On Wed, Aug 24, 2016 at 5:24 AM, ??????? ????? wrote: > >> Hi there, >> >> I recently found out that GradientBoostingRegressor uses MeanEstimator >> for the initial estimator in ensemble. Could you please point out (or >> explain) to the research showing superiority of this approach compared to >> the usage of DecisionTreeRegressor? >> >> -- >> Yours sincerely, >> Alexey A. Dral >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Yours sincerely, Alexey A. Dral -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Sun Aug 28 10:35:33 2016 From: drraph at gmail.com (Raphael C) Date: Sun, 28 Aug 2016 16:35:33 +0200 Subject: [scikit-learn] Does NMF optimise over observed values Message-ID: Reading the docs for http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html it says The objective function is: 0.5 * ||X - WH||_Fro^2 + alpha * l1_ratio * ||vec(W)||_1 + alpha * l1_ratio * ||vec(H)||_1 + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2 + 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2 Where: ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm) ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm) This seems to suggest that it is optimising over all values in X even if X is sparse. When using NMF for collaborative filtering we need the objective function to be defined over only the defined elements of X. The remaining elements should effectively be regarded as missing. What is the true objective function NMF is using? Raphael -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Sun Aug 28 10:57:44 2016 From: drraph at gmail.com (Raphael C) Date: Sun, 28 Aug 2016 16:57:44 +0200 Subject: [scikit-learn] Does NMF optimise over observed values In-Reply-To: References: Message-ID: What I meant was, how is the objective function defined when X is sparse? Raphael On Sunday, August 28, 2016, Raphael C wrote: > Reading the docs for http://scikit-learn.org/stable/modules/generated/ > sklearn.decomposition.NMF.html it says > > The objective function is: > > 0.5 * ||X - WH||_Fro^2 > + alpha * l1_ratio * ||vec(W)||_1 > + alpha * l1_ratio * ||vec(H)||_1 > + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2 > + 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2 > > Where: > > ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm) > ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm) > > This seems to suggest that it is optimising over all values in X even if X is sparse. When using NMF for collaborative filtering we need the objective function to be defined over only the defined elements of X. The remaining elements should effectively be regarded as missing. > > > What is the true objective function NMF is using? > > > Raphael > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arthur.mensch at inria.fr Sun Aug 28 11:44:43 2016 From: arthur.mensch at inria.fr (Arthur Mensch) Date: Sun, 28 Aug 2016 17:44:43 +0200 Subject: [scikit-learn] Does NMF optimise over observed values In-Reply-To: References: Message-ID: Zeros are considered as zeros in the objective function, not as missing values - - i.e. no mask in the loss function. Le 28 ao?t 2016 16:58, "Raphael C" a ?crit : What I meant was, how is the objective function defined when X is sparse? Raphael On Sunday, August 28, 2016, Raphael C wrote: > Reading the docs for http://scikit-learn.org/st > able/modules/generated/sklearn.decomposition.NMF.html it says > > The objective function is: > > 0.5 * ||X - WH||_Fro^2 > + alpha * l1_ratio * ||vec(W)||_1 > + alpha * l1_ratio * ||vec(H)||_1 > + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2 > + 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2 > > Where: > > ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm) > ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm) > > This seems to suggest that it is optimising over all values in X even if X is sparse. When using NMF for collaborative filtering we need the objective function to be defined over only the defined elements of X. The remaining elements should effectively be regarded as missing. > > > What is the true objective function NMF is using? > > > Raphael > > _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Sun Aug 28 12:15:55 2016 From: drraph at gmail.com (Raphael C) Date: Sun, 28 Aug 2016 18:15:55 +0200 Subject: [scikit-learn] Does NMF optimise over observed values In-Reply-To: References: Message-ID: Thank you for the quick reply. Just to make sure I understand, if X is sparse and n by n with X[0,0] = 1, X_[n-1, n-1]=0 explicitly set (that is only two values are set in X) then this is treated the same for the purposes of the objective function as the all zeros n by n matrix with X[0,0] set to 1? That is all elements of X that are not specified explicitly are assumed to be 0? It would be really useful if it were possible to have a version of NMF where contributions to the objective function are only counted where the value is explicitly set in X. This is AFAIK the standard formulation for collaborative filtering. Would there be any interest in doing this? In theory it should be a simple modification of the optimisation code. Raphael On Sunday, August 28, 2016, Arthur Mensch wrote: > Zeros are considered as zeros in the objective function, not as missing > values - - i.e. no mask in the loss function. > Le 28 ao?t 2016 16:58, "Raphael C" > a ?crit : > > What I meant was, how is the objective function defined when X is sparse? > > Raphael > > > On Sunday, August 28, 2016, Raphael C > wrote: > >> Reading the docs for http://scikit-learn.org/st >> able/modules/generated/sklearn.decomposition.NMF.html it says >> >> The objective function is: >> >> 0.5 * ||X - WH||_Fro^2 >> + alpha * l1_ratio * ||vec(W)||_1 >> + alpha * l1_ratio * ||vec(H)||_1 >> + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2 >> + 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2 >> >> Where: >> >> ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm) >> ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm) >> >> This seems to suggest that it is optimising over all values in X even if X is sparse. When using NMF for collaborative filtering we need the objective function to be defined over only the defined elements of X. The remaining elements should effectively be regarded as missing. >> >> >> What is the true objective function NMF is using? >> >> >> Raphael >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Aug 28 12:20:03 2016 From: t3kcit at gmail.com (Andy) Date: Sun, 28 Aug 2016 12:20:03 -0400 Subject: [scikit-learn] Fwd: inconsistency between libsvm and scikit-learn.svc results In-Reply-To: References: <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com> Message-ID: <350ce275-2c9e-4d44-7155-edd4293acb2c@gmail.com> On 08/27/2016 09:48 AM, Joel Nothman wrote: > I don't think we should assume that this is the only possible reason > for inconsistency. Could you give us a small snippet of data and code > on which you find this inconsistency? > I would also expect different settings or random states or data preparation to be more likely culprits. From t3kcit at gmail.com Sun Aug 28 12:20:45 2016 From: t3kcit at gmail.com (Andy) Date: Sun, 28 Aug 2016 12:20:45 -0400 Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD In-Reply-To: References: <57C04D8B.40001@gmail.com> <532083f1-0647-989d-6f35-2a83176199ea@gmail.com> Message-ID: If you do "with_mean=False" it should be the same, right? On 08/27/2016 12:20 PM, Olivier Grisel wrote: > I am not sure this is exactly the same because we do not center the > data in the TruncatedSVD case (as opposed to the real PCA case where > whitening is the same as calling StandardScaler). > > Having an option to normalize the transformed data by sigma seems like > a good idea but we should probably not call that whitening. > From michael at bommaritollc.com Sun Aug 28 12:22:44 2016 From: michael at bommaritollc.com (Michael Bommarito) Date: Sun, 28 Aug 2016 12:22:44 -0400 Subject: [scikit-learn] Fwd: inconsistency between libsvm and scikit-learn.svc results In-Reply-To: <350ce275-2c9e-4d44-7155-edd4293acb2c@gmail.com> References: <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com> <350ce275-2c9e-4d44-7155-edd4293acb2c@gmail.com> Message-ID: Any chance it's related to the seed issue in the "Decoding Differences Between SKL SVM and Matlab Libsvm Even When Parameters the Same" thread? Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Sun, Aug 28, 2016 at 12:20 PM, Andy wrote: > > > On 08/27/2016 09:48 AM, Joel Nothman wrote: > >> I don't think we should assume that this is the only possible reason for >> inconsistency. Could you give us a small snippet of data and code on which >> you find this inconsistency? >> >> I would also expect different settings or random states or data > preparation to be more likely culprits. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Sun Aug 28 12:29:59 2016 From: drraph at gmail.com (Raphael C) Date: Sun, 28 Aug 2016 18:29:59 +0200 Subject: [scikit-learn] Does NMF optimise over observed values In-Reply-To: References: Message-ID: To give a little context from the web, see e.g. http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/ where it explains: " A question might have come to your mind by now: if we find two matrices [image: \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times \mathbf{Q}] approximates [image: \mathbf{R}], isn?t that our predictions of all the unseen ratings will all be zeros? In fact, we are not really trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only try to minimise the errors of the observed user-item pairs. " Raphael On Sunday, August 28, 2016, Raphael C wrote: > Thank you for the quick reply. Just to make sure I understand, if X is > sparse and n by n with X[0,0] = 1, X_[n-1, n-1]=0 explicitly set (that is > only two values are set in X) then this is treated the same for the > purposes of the objective function as the all zeros n by n matrix with > X[0,0] set to 1? That is all elements of X that are not specified > explicitly are assumed to be 0? > > It would be really useful if it were possible to have a version of NMF > where contributions to the objective function are only counted where the > value is explicitly set in X. This is AFAIK the standard formulation for > collaborative filtering. Would there be any interest in doing this? In > theory it should be a simple modification of the optimisation code. > > Raphael > > > > On Sunday, August 28, 2016, Arthur Mensch > wrote: > >> Zeros are considered as zeros in the objective function, not as missing >> values - - i.e. no mask in the loss function. >> Le 28 ao?t 2016 16:58, "Raphael C" a ?crit : >> >> What I meant was, how is the objective function defined when X is sparse? >> >> Raphael >> >> >> On Sunday, August 28, 2016, Raphael C wrote: >> >>> Reading the docs for http://scikit-learn.org/st >>> able/modules/generated/sklearn.decomposition.NMF.html it says >>> >>> The objective function is: >>> >>> 0.5 * ||X - WH||_Fro^2 >>> + alpha * l1_ratio * ||vec(W)||_1 >>> + alpha * l1_ratio * ||vec(H)||_1 >>> + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2 >>> + 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2 >>> >>> Where: >>> >>> ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm) >>> ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm) >>> >>> This seems to suggest that it is optimising over all values in X even if X is sparse. When using NMF for collaborative filtering we need the objective function to be defined over only the defined elements of X. The remaining elements should effectively be regarded as missing. >>> >>> >>> What is the true objective function NMF is using? >>> >>> >>> Raphael >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Aug 28 12:37:05 2016 From: t3kcit at gmail.com (Andy) Date: Sun, 28 Aug 2016 12:37:05 -0400 Subject: [scikit-learn] Does NMF optimise over observed values In-Reply-To: References: Message-ID: <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com> On 08/28/2016 12:29 PM, Raphael C wrote: > To give a little context from the web, see e.g. > http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/ where > it explains: > > " > A question might have come to your mind by now: if we find two > matrices \mathbf{P} and \mathbf{Q} such that \mathbf{P} \times > \mathbf{Q} approximates \mathbf{R}, isn?t that our predictions of all > the unseen ratings will all be zeros? In fact, we are not really > trying to come up with \mathbf{P} and \mathbf{Q} such that we can > reproduce \mathbf{R} exactly. Instead, we will only try to minimise > the errors of the observed user-item pairs. > " Yes, the sklearn interface is not meant for matrix completion but matrix-factorization. There was a PR for some matrix completion for missing value imputation at some point. In general, scikit-learn doesn't really implement anything for recommendation algorithms as that requires a different interface. -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Sun Aug 28 13:16:14 2016 From: drraph at gmail.com (Raphael C) Date: Sun, 28 Aug 2016 19:16:14 +0200 Subject: [scikit-learn] Does NMF optimise over observed values In-Reply-To: <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com> References: <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com> Message-ID: On Sunday, August 28, 2016, Andy wrote: > > > On 08/28/2016 12:29 PM, Raphael C wrote: > > To give a little context from the web, see e.g. http://www.quuxlabs.com/ > blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation- > in-python/ where it explains: > > " > A question might have come to your mind by now: if we find two matrices [image: > \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times > \mathbf{Q}] approximates [image: \mathbf{R}], isn?t that our predictions > of all the unseen ratings will all be zeros? In fact, we are not really > trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such > that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only > try to minimise the errors of the observed user-item pairs. > " > > Yes, the sklearn interface is not meant for matrix completion but > matrix-factorization. > There was a PR for some matrix completion for missing value imputation at > some point. > > In general, scikit-learn doesn't really implement anything for > recommendation algorithms as that requires a different interface. > Thanks Andy. I just looked up that PR. I was thinking simply producing a different factorisation optimised only over the observed values wouldn't need a new interface. That in itself would be hugely useful. I can see that providing a full drop in recommender system would involve more work. Raphael -------------- next part -------------- An HTML attachment was scrubbed... URL: From cs14btech11041 at iith.ac.in Sun Aug 28 15:09:58 2016 From: cs14btech11041 at iith.ac.in (Ibrahim Dalal) Date: Mon, 29 Aug 2016 00:39:58 +0530 Subject: [scikit-learn] Issue with DecisionTreeClassifier Message-ID: Dear Developers, DecisionTreeClassifier.decision_path() as used here http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html is giving the following error: AttributeError: 'DecisionTreeClassifier' object has no attribute 'decision_path' Kindly help. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From nfliu at uw.edu Sun Aug 28 15:23:59 2016 From: nfliu at uw.edu (Nelson Liu) Date: Sun, 28 Aug 2016 19:23:59 +0000 Subject: [scikit-learn] Issue with DecisionTreeClassifier In-Reply-To: References: Message-ID: That should be: node indicator = estimator.tree_.decision_path(X_test) PR welcome :) On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn < scikit-learn at python.org> wrote: > Dear Developers, > > DecisionTreeClassifier.decision_path() as used here > http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html > is giving the following error: > > AttributeError: 'DecisionTreeClassifier' object has no attribute > 'decision_path' > > Kindly help. > > Thanks > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nfliu at uw.edu Sun Aug 28 15:25:33 2016 From: nfliu at uw.edu (Nelson Liu) Date: Sun, 28 Aug 2016 19:25:33 +0000 Subject: [scikit-learn] Issue with DecisionTreeClassifier In-Reply-To: References: Message-ID: Oops, phone removed the underscore between the two words of the variable name but I think you get the point. Nelson On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn < scikit-learn at python.org> wrote: > Dear Developers, > > DecisionTreeClassifier.decision_path() as used here > http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html > is giving the following error: > > AttributeError: 'DecisionTreeClassifier' object has no attribute > 'decision_path' > > Kindly help. > > Thanks > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Mon Aug 29 06:39:46 2016 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Mon, 29 Aug 2016 12:39:46 +0200 Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD In-Reply-To: References: <57C04D8B.40001@gmail.com> <532083f1-0647-989d-6f35-2a83176199ea@gmail.com> Message-ID: <57C410F2.8090905@gmail.com> Thank you for all your responses! In the LSA what is equivalent, I think, is - to apply a L2 normalization (not the StandardScaler) after the LSA and then compute the cosine similarity between document vectors simply as a dot product. - not apply the L2 normalization and call the `cosine_similarity` function instead. I have applied this normalization to the previous example, and it produces indeed equivalent results (i.e. does not solve the problem). Opening an issue on this for further discussion https://github.com/scikit-learn/scikit-learn/issues/7283 Thanks for your feedback! -- Roman On 28/08/16 18:20, Andy wrote: > If you do "with_mean=False" it should be the same, right? > > On 08/27/2016 12:20 PM, Olivier Grisel wrote: >> I am not sure this is exactly the same because we do not center the >> data in the TruncatedSVD case (as opposed to the real PCA case where >> whitening is the same as calling StandardScaler). >> >> Having an option to normalize the transformed data by sigma seems like >> a good idea but we should probably not call that whitening. >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From cs14btech11041 at iith.ac.in Mon Aug 29 11:15:52 2016 From: cs14btech11041 at iith.ac.in (Ibrahim Dalal) Date: Mon, 29 Aug 2016 20:45:52 +0530 Subject: [scikit-learn] Issue with DecisionTreeClassifier In-Reply-To: References: Message-ID: Hi, Is there a way to extract impurity value of a node in DecisionTreeClassifier? I am able to get this value in graph (using export_grapgviz), but can't figure out how to get this value in my code. Is there any attribute similar to estimator.tree_.children_left? Thanks On Mon, Aug 29, 2016 at 12:53 AM, Nelson Liu wrote: > That should be: > node indicator = estimator.tree_.decision_path(X_test) > > PR welcome :) > > On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > >> Dear Developers, >> >> DecisionTreeClassifier.decision_path() as used here >> http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html >> is giving the following error: >> >> AttributeError: 'DecisionTreeClassifier' object has no attribute >> 'decision_path' >> >> Kindly help. >> >> Thanks >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nfliu at uw.edu Mon Aug 29 11:23:30 2016 From: nfliu at uw.edu (Nelson Liu) Date: Mon, 29 Aug 2016 15:23:30 +0000 Subject: [scikit-learn] Issue with DecisionTreeClassifier In-Reply-To: References: Message-ID: Hi, Yes, it's estimator.tree_.impurity Nelson On Mon, Aug 29, 2016, 09:18 Ibrahim Dalal via scikit-learn < scikit-learn at python.org> wrote: > Hi, > > Is there a way to extract impurity value of a node in > DecisionTreeClassifier? I am able to get this value in graph (using > export_grapgviz), but can't figure out how to get this value in my code. Is > there any attribute similar to estimator.tree_.children_left? > > Thanks > > On Mon, Aug 29, 2016 at 12:53 AM, Nelson Liu wrote: > >> That should be: >> node indicator = estimator.tree_.decision_path(X_test) >> >> PR welcome :) >> >> On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn < >> scikit-learn at python.org> wrote: >> >>> Dear Developers, >>> >>> DecisionTreeClassifier.decision_path() as used here >>> http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html >>> is giving the following error: >>> >>> AttributeError: 'DecisionTreeClassifier' object has no attribute >>> 'decision_path' >>> >>> Kindly help. >>> >>> Thanks >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cs14btech11041 at iith.ac.in Mon Aug 29 11:44:21 2016 From: cs14btech11041 at iith.ac.in (Ibrahim Dalal) Date: Mon, 29 Aug 2016 21:14:21 +0530 Subject: [scikit-learn] Issue with DecisionTreeClassifier In-Reply-To: References: Message-ID: Thanks Nelson. Is there any way to access number of training samples in a node? Thanks On Mon, Aug 29, 2016 at 8:53 PM, Nelson Liu wrote: > Hi, > Yes, it's estimator.tree_.impurity > > Nelson > > On Mon, Aug 29, 2016, 09:18 Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > >> Hi, >> >> Is there a way to extract impurity value of a node in >> DecisionTreeClassifier? I am able to get this value in graph (using >> export_grapgviz), but can't figure out how to get this value in my code. Is >> there any attribute similar to estimator.tree_.children_left? >> >> Thanks >> >> On Mon, Aug 29, 2016 at 12:53 AM, Nelson Liu wrote: >> >>> That should be: >>> node indicator = estimator.tree_.decision_path(X_test) >>> >>> PR welcome :) >>> >>> On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn < >>> scikit-learn at python.org> wrote: >>> >>>> Dear Developers, >>>> >>>> DecisionTreeClassifier.decision_path() as used here >>>> http://scikit-learn.org/dev/auto_examples/tree/unveil_ >>>> tree_structure.html is giving the following error: >>>> >>>> AttributeError: 'DecisionTreeClassifier' object has no attribute >>>> 'decision_path' >>>> >>>> Kindly help. >>>> >>>> Thanks >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Aug 29 11:50:24 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 29 Aug 2016 11:50:24 -0400 Subject: [scikit-learn] Does NMF optimise over observed values In-Reply-To: References: <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com> Message-ID: On 08/28/2016 01:16 PM, Raphael C wrote: > > > On Sunday, August 28, 2016, Andy > wrote: > > > > On 08/28/2016 12:29 PM, Raphael C wrote: >> To give a little context from the web, see e.g. >> http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/ >> where >> it explains: >> >> " >> A question might have come to your mind by now: if we find two >> matrices \mathbf{P} and \mathbf{Q} such that \mathbf{P} \times >> \mathbf{Q} approximates \mathbf{R}, isn?t that our predictions of >> all the unseen ratings will all be zeros? In fact, we are not >> really trying to come up with \mathbf{P} and \mathbf{Q} such that >> we can reproduce \mathbf{R} exactly. Instead, we will only try to >> minimise the errors of the observed user-item pairs. >> " > Yes, the sklearn interface is not meant for matrix completion but > matrix-factorization. > There was a PR for some matrix completion for missing value > imputation at some point. > > In general, scikit-learn doesn't really implement anything for > recommendation algorithms as that requires a different interface. > > > Thanks Andy. I just looked up that PR. > > I was thinking simply producing a different factorisation optimised > only over the observed values wouldn't need a new interface. That in > itself would be hugely useful. Depends. Usually you don't want to complete all values, but only compute a factorization. What do you return? Only the factors? The PR implements completing everything, and that you can do with the transformer interface. I'm not sure what the status of the PR is, but doing that with NMF instead of SVD would certainly also be interesting. -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Aug 29 11:52:16 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 29 Aug 2016 11:52:16 -0400 Subject: [scikit-learn] Issue with DecisionTreeClassifier In-Reply-To: References: Message-ID: <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com> On 08/28/2016 03:23 PM, Nelson Liu wrote: > That should be: > node indicator = estimator.tree_.decision_path(X_test) > > PR welcome :) Was there a reason not to make this a "plot" example? Would it take too long? Not having run examples by CI is a pretty big maintenance burden. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.duprelatour at orange.fr Mon Aug 29 13:01:57 2016 From: tom.duprelatour at orange.fr (Tom DLT) Date: Mon, 29 Aug 2016 19:01:57 +0200 Subject: [scikit-learn] Does NMF optimise over observed values In-Reply-To: References: <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com> Message-ID: If X is sparse, explicit zeros and missing-value zeros are **both** considered as zeros in the objective functions. Changing the objective function wouldn't need a new interface, yet I am not sure the code change would be completely trivial. The question is: do we want this new objective function in scikit-learn, since we have no other recommendation-like algorithm? If we agree that it would useful, feel free to send a PR. Tom 2016-08-29 17:50 GMT+02:00 Andreas Mueller : > > > On 08/28/2016 01:16 PM, Raphael C wrote: > > > > On Sunday, August 28, 2016, Andy wrote: > >> >> >> On 08/28/2016 12:29 PM, Raphael C wrote: >> >> To give a little context from the web, see e.g. http://www.quuxlabs.com/b >> log/2010/09/matrix-factorization-a-simple-tutorial-and- >> implementation-in-python/ where it explains: >> >> " >> A question might have come to your mind by now: if we find two matrices [image: >> \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times >> \mathbf{Q}] approximates [image: \mathbf{R}], isn?t that our predictions >> of all the unseen ratings will all be zeros? In fact, we are not really >> trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such >> that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only >> try to minimise the errors of the observed user-item pairs. >> " >> >> Yes, the sklearn interface is not meant for matrix completion but >> matrix-factorization. >> There was a PR for some matrix completion for missing value imputation at >> some point. >> >> In general, scikit-learn doesn't really implement anything for >> recommendation algorithms as that requires a different interface. >> > > Thanks Andy. I just looked up that PR. > > I was thinking simply producing a different factorisation optimised only > over the observed values wouldn't need a new interface. That in itself > would be hugely useful. > > Depends. Usually you don't want to complete all values, but only compute a > factorization. What do you return? Only the factors? > The PR implements completing everything, and that you can do with the > transformer interface. I'm not sure what the status of the PR is, > but doing that with NMF instead of SVD would certainly also be interesting. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Mon Aug 29 14:11:42 2016 From: drraph at gmail.com (Raphael C) Date: Mon, 29 Aug 2016 20:11:42 +0200 Subject: [scikit-learn] Does NMF optimise over observed values In-Reply-To: References: <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com> Message-ID: On Monday, August 29, 2016, Andreas Mueller wrote: > > > On 08/28/2016 01:16 PM, Raphael C wrote: > > > > On Sunday, August 28, 2016, Andy > wrote: > >> >> >> On 08/28/2016 12:29 PM, Raphael C wrote: >> >> To give a little context from the web, see e.g. http://www.quuxlabs.com/b >> log/2010/09/matrix-factorization-a-simple-tutorial-and- >> implementation-in-python/ where it explains: >> >> " >> A question might have come to your mind by now: if we find two matrices [image: >> \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times >> \mathbf{Q}] approximates [image: \mathbf{R}], isn?t that our predictions >> of all the unseen ratings will all be zeros? In fact, we are not really >> trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such >> that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only >> try to minimise the errors of the observed user-item pairs. >> " >> >> Yes, the sklearn interface is not meant for matrix completion but >> matrix-factorization. >> There was a PR for some matrix completion for missing value imputation at >> some point. >> >> In general, scikit-learn doesn't really implement anything for >> recommendation algorithms as that requires a different interface. >> > > Thanks Andy. I just looked up that PR. > > I was thinking simply producing a different factorisation optimised only > over the observed values wouldn't need a new interface. That in itself > would be hugely useful. > > Depends. Usually you don't want to complete all values, but only compute a > factorization. What do you return? Only the factors? > > The PR implements completing everything, and that you can do with the > transformer interface. I'm not sure what the status of the PR is, > but doing that with NMF instead of SVD would certainly also be interesting. > I was thinking you would literally return W and H so that WH approx X. The user can then decide what to do with the factorisation just like when doing SVD. Raphael -------------- next part -------------- An HTML attachment was scrubbed... URL: From cs14btech11041 at iith.ac.in Mon Aug 29 23:03:59 2016 From: cs14btech11041 at iith.ac.in (Ibrahim Dalal) Date: Tue, 30 Aug 2016 08:33:59 +0530 Subject: [scikit-learn] Issue with DecisionTreeClassifier In-Reply-To: <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com> References: <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com> Message-ID: Hi, What does the estimator.tree_.value array represent? I looked up the source code but not able to get what it is. I am interested in the number of training samples of each class in a given tree node. Thanks On Mon, Aug 29, 2016 at 9:22 PM, Andreas Mueller wrote: > > > On 08/28/2016 03:23 PM, Nelson Liu wrote: > > That should be: > node indicator = estimator.tree_.decision_path(X_test) > > PR welcome :) > > Was there a reason not to make this a "plot" example? > Would it take too long? Not having run examples by CI is a pretty big > maintenance burden. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nfliu at uw.edu Mon Aug 29 23:22:41 2016 From: nfliu at uw.edu (Nelson Liu) Date: Tue, 30 Aug 2016 03:22:41 +0000 Subject: [scikit-learn] Issue with DecisionTreeClassifier In-Reply-To: References: <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com> Message-ID: estimator.tree_.value gives the constant prediction of the tree at each node. Think of it as what the tree would output if that node was a leaf. I don't think we have a readily available way of checking the number of training samples of each class in a given tree node. The closest thing easily accessible is estimator.tree_.n_node_samples. Getting finer-grained counts of the number of samples in each class would require modifying the source code, I think. On Mon, Aug 29, 2016 at 8:06 PM Ibrahim Dalal via scikit-learn < scikit-learn at python.org> wrote: > Hi, > > What does the estimator.tree_.value array represent? I looked up the > source code but not able to get what it is. I am interested in the number > of training samples of each class in a given tree node. > > Thanks > > On Mon, Aug 29, 2016 at 9:22 PM, Andreas Mueller wrote: > >> >> >> On 08/28/2016 03:23 PM, Nelson Liu wrote: >> >> That should be: >> node indicator = estimator.tree_.decision_path(X_test) >> >> PR welcome :) >> >> Was there a reason not to make this a "plot" example? >> Would it take too long? Not having run examples by CI is a pretty big >> maintenance burden. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Aug 29 23:31:19 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 30 Aug 2016 13:31:19 +1000 Subject: [scikit-learn] Issue with DecisionTreeClassifier In-Reply-To: References: <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com> Message-ID: Or just running estimator.tree_.apply(X_train) and inferring from there. On 30 August 2016 at 13:22, Nelson Liu wrote: > estimator.tree_.value gives the constant prediction of the tree at each > node. Think of it as what the tree would output if that node was a leaf. > > I don't think we have a readily available way of checking the number of > training samples of each class in a given tree node. The closest thing > easily accessible is estimator.tree_.n_node_samples. Getting > finer-grained counts of the number of samples in each class would require > modifying the source code, I think. > > On Mon, Aug 29, 2016 at 8:06 PM Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > >> Hi, >> >> What does the estimator.tree_.value array represent? I looked up the >> source code but not able to get what it is. I am interested in the number >> of training samples of each class in a given tree node. >> >> Thanks >> >> On Mon, Aug 29, 2016 at 9:22 PM, Andreas Mueller >> wrote: >> >>> >>> >>> On 08/28/2016 03:23 PM, Nelson Liu wrote: >>> >>> That should be: >>> node indicator = estimator.tree_.decision_path(X_test) >>> >>> PR welcome :) >>> >>> Was there a reason not to make this a "plot" example? >>> Would it take too long? Not having run examples by CI is a pretty big >>> maintenance burden. >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Aug 30 09:56:22 2016 From: t3kcit at gmail.com (Andy) Date: Tue, 30 Aug 2016 09:56:22 -0400 Subject: [scikit-learn] Issue with DecisionTreeClassifier In-Reply-To: References: <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com> Message-ID: On 08/29/2016 11:22 PM, Nelson Liu wrote: > estimator.tree_.value gives the constant prediction of the tree at > each node. Think of it as what the tree would output if that node was > a leaf. well it's also the weighted number of samples of each class, right? -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Aug 30 10:31:51 2016 From: t3kcit at gmail.com (Andy) Date: Tue, 30 Aug 2016 10:31:51 -0400 Subject: [scikit-learn] [Scikit-learn-general] Dropping Python 2.6 compatibility In-Reply-To: <20160104132846.GA1242267@phare.normalesup.org> References: <20160104132846.GA1242267@phare.normalesup.org> Message-ID: Hi all. Picking up this old thread, I propose we announce that 0.18 is the last release that will support Python 2.6. That will give people some time to think about it between releases. Wdyt? Andy On 01/04/2016 08:28 AM, Gael Varoquaux wrote: > Happy new year everybody, > > As a new year resolution, I suggest that we drop Python 2.6 > compatibility. > > For an argumentation in this favor, see > http://www.snarky.ca/stop-using-python-2-6 (I don't buy everything there, > but the core idea is there). > > For us, this will mean more usage of context managers, which is good. > > The down side is that many clusters run RedHat variant that are still > under 2.6 (Duh!). The question is: are people using the stock Python on > the clusters, or something else. > > Opinions please? > > Ga?l > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general From gael.varoquaux at normalesup.org Tue Aug 30 10:37:47 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 30 Aug 2016 16:37:47 +0200 Subject: [scikit-learn] [Scikit-learn-general] Dropping Python 2.6 compatibility In-Reply-To: References: <20160104132846.GA1242267@phare.normalesup.org> Message-ID: <20160830143747.GH1642932@phare.normalesup.org> > Picking up this old thread, I propose we announce that 0.18 is the last > release that > will support Python 2.6. > That will give people some time to think about it between releases. As you know: +1 from my side. One of my arguments for this is that it becomes harder and harder to set up continuous integration environments to test related projects with 2.6. Hence related projects are likely to coderot under 2.6. Thanks for raising this issue. Ga?l From ilya.persky at gmail.com Tue Aug 30 15:19:40 2016 From: ilya.persky at gmail.com (Ilya Persky) Date: Tue, 30 Aug 2016 22:19:40 +0300 Subject: [scikit-learn] How to deal with minor inconsistencies in scikit source code? Message-ID: Hi All! I'm now reading scikit-learn source code and sometimes meet minor inconsistencies here and there like unnecessary copying of some array or some very unimportant race condition. Nothing like serious bug really. What should I do about it? Create an issue for each case would be an overkill. Create an issue for all of them and add pull request with fixes? Or first send a letter with them here?.. Again I'm new to this code and can be easily missing something (something looking like minor bug could appear to be a feature :) ). -- Thank you, Ilya. -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Aug 30 15:24:48 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 30 Aug 2016 15:24:48 -0400 Subject: [scikit-learn] How to deal with minor inconsistencies in scikit source code? In-Reply-To: References: Message-ID: <65c6d2b1-35ff-2769-d627-7e61f2392df1@gmail.com> Hi Ilya. You can raise an issue with multiple minor problems, or you can just send a PR. We don't really like to do many cosmetic fixes, because they tend to create merge conflicts. But for semantic changes, like avoiding an array copy, we're very happy about any improvements. You can totally pack multiple one-line changes into a single PR if they are all simple to review. One thing to keep in mind: the shorter the PR, the faster the review and merge ;) Andy On 08/30/2016 03:19 PM, Ilya Persky wrote: > Hi All! > > I'm now reading scikit-learn source code and sometimes meet minor > inconsistencies here and there like unnecessary copying of some array > or some very unimportant race condition. Nothing like serious bug really. > > What should I do about it? Create an issue for each case would be an > overkill. Create an issue for all of them and add pull request with > fixes? Or first send a letter with them here?.. > > Again I'm new to this code and can be easily missing something > (something looking like minor bug could appear to be a feature :) ). > > -- > Thank you, > Ilya. > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From douglas.chan at ieee.org Wed Aug 31 01:26:11 2016 From: douglas.chan at ieee.org (Douglas Chan) Date: Tue, 30 Aug 2016 22:26:11 -0700 Subject: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1 Message-ID: Hello everyone, I notice conditions when Feature Importance values do not add up to 1 in ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees. I wonder if there?s a bug in the code. This error occurs when the ensemble has a large number of estimators. The exact conditions depend variously. For example, the error shows up sooner with a smaller amount of training samples. Or, if the depth of the tree is large. When this error appears, the predicted value seems to have converged. But it?s unclear if the error is causing the predicted value not to change with more estimators. In fact, the feature importance sum goes lower and lower with more estimators thereafter. I wonder if we?re hitting some floating point calculation error. Looking forward to hear your thoughts on this. Thank you! -Doug -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Wed Aug 31 02:28:29 2016 From: drraph at gmail.com (Raphael C) Date: Wed, 31 Aug 2016 08:28:29 +0200 Subject: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1 In-Reply-To: References: Message-ID: Can you provide a reproducible example? Raphael On Wednesday, August 31, 2016, Douglas Chan wrote: > Hello everyone, > > I notice conditions when Feature Importance values do not add up to 1 in > ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees. I > wonder if there?s a bug in the code. > > This error occurs when the ensemble has a large number of estimators. The > exact conditions depend variously. For example, the error shows up sooner > with a smaller amount of training samples. Or, if the depth of the tree is > large. > > When this error appears, the predicted value seems to have converged. But > it?s unclear if the error is causing the predicted value not to change with > more estimators. In fact, the feature importance sum goes lower and lower > with more estimators thereafter. > > I wonder if we?re hitting some floating point calculation error. > > Looking forward to hear your thoughts on this. > > Thank you! > -Doug > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Myles.Gartland at Rockhurst.edu Tue Aug 30 22:20:38 2016 From: Myles.Gartland at Rockhurst.edu (Gartland, Myles) Date: Tue, 30 Aug 2016 21:20:38 -0500 Subject: [scikit-learn] MLPClassifier release Message-ID: Curious when .18 will be released. I am specifically interested in the MLPClassifier to show my students in class. Not sure I want them on the dev fork just yet. From t3kcit at gmail.com Wed Aug 31 13:31:43 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 31 Aug 2016 13:31:43 -0400 Subject: [scikit-learn] MLPClassifier release In-Reply-To: References: Message-ID: On 08/30/2016 10:20 PM, Gartland, Myles wrote: > Curious when .18 will be released. I am specifically interested in the MLPClassifier to show my students in class. Not sure I want them on the dev fork just yet. Release candidate probably next week. Andy From t3kcit at gmail.com Wed Aug 31 17:15:01 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 31 Aug 2016 17:15:01 -0400 Subject: [scikit-learn] Declaring numpy and scipy dependencies? In-Reply-To: References: <195faf56-d8c6-49e0-7fd7-5bb4f1b22931@gmail.com> <98971054-939E-416C-BA47-AE5AD515E170@sebastianraschka.com> <705a27d4-3643-bc9b-11a8-80ba0f6752bf@gmail.com> Message-ID: <36f5d0ef-397d-f5bc-c312-19793482fb06@gmail.com> On 07/28/2016 03:16 PM, Matthew Brett wrote: > On Thu, Jul 28, 2016 at 8:10 PM, Andreas Mueller wrote: >> >> On 07/28/2016 03:04 PM, Matthew Brett wrote: >>> On Thu, Jul 28, 2016 at 7:55 PM, Sebastian Raschka >>> wrote: >>>> I think that should work fine for the `pip install scikit-learn`, >>>> however, I think the problem was with upgrading, right? >>>> E.g., if you run >>>> >>>> pip install scikit-learn --upgrade >>>> >>>> it would try to upgrade numpy and scipy as well, which may not be >>>> desired. I think the only workaround would be to run >>>> >>>> pip install scikit-learn --upgrade --no-deps >>>> >>>> unless they changed the behavior recently. I mean, it?s not really a >>>> problem, but many users may not know about the --no-deps flag. >>>> >>> Also - the install will work fine for platforms with wheels, but is >>> still bad for platforms without - like the Raspberry Pi. >> Hm... so these would be ARM wheels? Or Raspberry Pi specific ones? > No, they'd have to be Raspberry Pi specific ones because no-one has > worked out a general ARM-wide specification, as we have for Intel > Linux = manylinux1. > Following up on this thread, I'm trying to write better installation instructions. https://github.com/scikit-learn/scikit-learn/pull/7313 What's the best-practice for cases when there are no wheels? I imagine there's also no conda channel for Raspberry Pi. So is it the package manager? Andy From douglas.chan at ieee.org Wed Aug 31 19:52:17 2016 From: douglas.chan at ieee.org (Douglas Chan) Date: Wed, 31 Aug 2016 16:52:17 -0700 Subject: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1 In-Reply-To: References: Message-ID: Thanks for your reply, Raphael. Here?s some code using the Boston dataset to reproduce this. === START CODE === import numpy as np from sklearn import datasets from sklearn.ensemble import GradientBoostingRegressor boston = datasets.load_boston() X, Y = (boston.data, boston.target) n_estimators = 712 # Note: From 712 onwards, the feature importance sum is less than 1 params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1} clf = GradientBoostingRegressor(**params) clf.fit(X, Y) feature_importance_sum = np.sum(clf.feature_importances_) print "At n_estimators = %i, feature importance sum = %f" % (n_estimators , feature_importance_sum) === END CODE === If we deem this to be an error, I can open a bug to track it. Please share your thoughts on it. Thank you, -Doug From: Raphael C Sent: Tuesday, August 30, 2016 11:28 PM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1 Can you provide a reproducible example? Raphael On Wednesday, August 31, 2016, Douglas Chan wrote: Hello everyone, I notice conditions when Feature Importance values do not add up to 1 in ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees. I wonder if there?s a bug in the code. This error occurs when the ensemble has a large number of estimators. The exact conditions depend variously. For example, the error shows up sooner with a smaller amount of training samples. Or, if the depth of the tree is large. When this error appears, the predicted value seems to have converged. But it?s unclear if the error is causing the predicted value not to change with more estimators. In fact, the feature importance sum goes lower and lower with more estimators thereafter. I wonder if we?re hitting some floating point calculation error. Looking forward to hear your thoughts on this. Thank you! -Doug -------------------------------------------------------------------------------- _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: