From yrohinkumar at gmail.com Tue Aug 1 09:15:56 2017 From: yrohinkumar at gmail.com (Rohin Kumar) Date: Tue, 1 Aug 2017 18:45:56 +0530 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures In-Reply-To: References: <379121501436421@mxfront4j.mail.yandex.net> Message-ID: Since you seem to be from Astrophysics/Cosmology background (I am assuming you are jakevdp - the creator of astroML - if you are - I am lucky!), I can explain my application scenario. I am trying to calculate the anisotropic two-point correlation function something like done in rp_pi_tpcf or s_mu_tpcf using pairs (DD,DR,RR) calculated from BallTree.two_point_correlation In halotools ( http://halotools.readthedocs.io/en/latest/function_usage/mock_observables_functions.html) it is implemented using rectangular grids. I could calculate 2pcf with custom metrics using one variable with BallTree as done in astroML. I intend to find the anisotropic counter part. Thanks & Regards, Rohin Y.Rohin Kumar, +919818092877. On Tue, Aug 1, 2017 at 5:18 PM, Rohin Kumar wrote: > Dear Jake, > > Thanks for your response. I meant to group/count pairs in boxes (using two > arrays simultaneously-hence needing 2 metrics) instead of one distance > array as the binning parameter. I don't know if the algorithm supports such > a thing. For now, I am proceeding with your suggestion of two ball trees at > huge computational cost. I hope I am able to frame my question properly. > > Thanks & Regards, > Rohin. > > > > On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas < > jakevdp at cs.washington.edu> wrote: > >> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar >> wrote: >> >>> *update* >>> >>> May be it doesn't have to be done at the tree creation level. It could >>> be using loops and creating two different balltrees. Something like >>> >>> tree1=BallTree(X,metric='metric1') #for x-z plane >>> tree2=BallTree(X,metric='metric2') #for y-z plane >>> >>> And then calculate correlation functions in a loop to get tpcf(X,r1,r2) >>> using tree1.two_point_correlation(X,r1) and >>> tree2.two_point_correlation(X,r2) >>> >> >> Hi Rohin, >> It's not exactly clear to me what you wish the tree to do with the two >> different metrics, but in any case the ball tree only supports one metric >> at a time. If you can construct your desired result from two ball trees >> each with its own metric, then that's probably the best way to proceed, >> Jake >> >> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yrohinkumar at gmail.com Tue Aug 1 07:48:23 2017 From: yrohinkumar at gmail.com (Rohin Kumar) Date: Tue, 1 Aug 2017 17:18:23 +0530 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures In-Reply-To: References: <379121501436421@mxfront4j.mail.yandex.net> Message-ID: Dear Jake, Thanks for your response. I meant to group/count pairs in boxes (using two arrays simultaneously-hence needing 2 metrics) instead of one distance array as the binning parameter. I don't know if the algorithm supports such a thing. For now, I am proceeding with your suggestion of two ball trees at huge computational cost. I hope I am able to frame my question properly. Thanks & Regards, Rohin. On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas wrote: > On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar > wrote: > >> *update* >> >> May be it doesn't have to be done at the tree creation level. It could be >> using loops and creating two different balltrees. Something like >> >> tree1=BallTree(X,metric='metric1') #for x-z plane >> tree2=BallTree(X,metric='metric2') #for y-z plane >> >> And then calculate correlation functions in a loop to get tpcf(X,r1,r2) >> using tree1.two_point_correlation(X,r1) and tree2.two_point_correlation( >> X,r2) >> > > Hi Rohin, > It's not exactly clear to me what you wish the tree to do with the two > different metrics, but in any case the ball tree only supports one metric > at a time. If you can construct your desired result from two ball trees > each with its own metric, then that's probably the best way to proceed, > Jake > > >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jeremiah.Johnson at unh.edu Tue Aug 1 12:03:01 2017 From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah) Date: Tue, 1 Aug 2017 16:03:01 +0000 Subject: [scikit-learn] question about class_weights in LogisticRegression Message-ID: Hello all, I'm looking for confirmation on an implementation detail that is somewhere in liblinear, but I haven't found documentation for yet. When the class_weights='balanced' parameter is set in LogisticRegression, then the regularisation parameter for an observation from class I is class_weight[I] * C, where C is the usual regularization parameter - is this correct? Thanks, Jeremiah -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Tue Aug 1 12:19:54 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Tue, 1 Aug 2017 09:19:54 -0700 Subject: [scikit-learn] question about class_weights in LogisticRegression In-Reply-To: References: Message-ID: I hope not. And not accoring to the docs... https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/linear_model/logistic.py#L947 class_weight : dict or 'balanced', optional Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. On Tue, Aug 1, 2017 at 9:03 AM, Johnson, Jeremiah wrote: > Hello all, > > I?m looking for confirmation on an implementation detail that is somewhere > in liblinear, but I haven?t found documentation for yet. When the > class_weights=?balanced? parameter is set in LogisticRegression, then the > regularisation parameter for an observation from class I is class_weight[I] > * C, where C is the usual regularization parameter ? is this correct? > > Thanks, > Jeremiah > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From Jeremiah.Johnson at unh.edu Tue Aug 1 12:30:22 2017 From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah) Date: Tue, 1 Aug 2017 16:30:22 +0000 Subject: [scikit-learn] question about class_weights in LogisticRegression In-Reply-To: References:

Message-ID: Right, I know how the class_weight calculation is performed. But then those class weights are utilized during the model fit process in some way in liblinear, and that?s what I am interested in. libSVM does class_weight[I] * C (https://www.csie.ntu.edu.tw/~cjlin/libsvm/); is the implementation in liblinear the same? Best, Jeremiah On 8/1/17, 12:19 PM, "scikit-learn on behalf of Stuart Reynolds" wrote: >I hope not. And not accoring to the docs... >https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_scikit-2Dl >earn_scikit-2Dlearn_blob_ab93d65_sklearn_linear-5Fmodel_logistic.py-23L947 >&d=DwIGaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIk >rRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_FwzMKMwnvEOWj8o&s=4uJZS3EaQgysmQlzjt- >yuLkSlcXTd5G50LkEFMcbZLQ&e= > >class_weight : dict or 'balanced', optional >Weights associated with classes in the form ``{class_label: weight}``. >If not given, all classes are supposed to have weight one. >The "balanced" mode uses the values of y to automatically adjust >weights inversely proportional to class frequencies in the input data >as ``n_samples / (n_classes * np.bincount(y))``. >Note that these weights will be multiplied with sample_weight (passed >through the fit method) if sample_weight is specified. > >On Tue, Aug 1, 2017 at 9:03 AM, Johnson, Jeremiah > wrote: >> Hello all, >> >> I?m looking for confirmation on an implementation detail that is >>somewhere >> in liblinear, but I haven?t found documentation for yet. When the >> class_weights=?balanced? parameter is set in LogisticRegression, then >>the >> regularisation parameter for an observation from class I is >>class_weight[I] >> * C, where C is the usual regularization parameter ? is this correct? >> >> Thanks, >> Jeremiah >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> >>https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.python.org_mail >>man_listinfo_scikit-2Dlearn&d=DwIGaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jo >>nm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_FwzMKMwn >>vEOWj8o&s=MgZoI9VOHFh3omGKHTASFx3NAVjj6AY3j_75mnOUg04&e= >> >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.python.org_mailm >an_listinfo_scikit-2Dlearn&d=DwIGaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm >4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_FwzMKMwnvEO >Wj8o&s=MgZoI9VOHFh3omGKHTASFx3NAVjj6AY3j_75mnOUg04&e= From jakevdp at cs.washington.edu Tue Aug 1 13:25:52 2017 From: jakevdp at cs.washington.edu (Jacob Vanderplas) Date: Tue, 1 Aug 2017 10:25:52 -0700 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures In-Reply-To: References: <379121501436421@mxfront4j.mail.yandex.net>

Message-ID: Hi Rohin, Ah, I see. I don't think a BallTree is the right data structure for an anisotropic N-point query, because it fundamentally assumes spherical symmetry of the metric. You may be able to do something like this with a specialized KD-tree, but scikit-learn doesn't support this, and I don't imagine that it ever will given the very specialized nature of the application. I'm certain someone has written efficient code for this operation in the astronomy community, but I don't know of any good Python package to recommend for this ? I'd suggest googling for keywords and seeing where that gets you. Thanks, Jake Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute On Tue, Aug 1, 2017 at 6:15 AM, Rohin Kumar wrote: > Since you seem to be from Astrophysics/Cosmology background (I am assuming > you are jakevdp - the creator of astroML - if you are - I am lucky!), I can > explain my application scenario. I am trying to calculate the anisotropic > two-point correlation function something like done in rp_pi_tpcf > > or s_mu_tpcf > > using pairs (DD,DR,RR) calculated from BallTree.two_point_correlation > > In halotools (http://halotools.readthedocs.io/en/latest/function_usage/ > mock_observables_functions.html) it is implemented using rectangular > grids. I could calculate 2pcf with custom metrics using one variable with > BallTree as done in astroML. I intend to find the anisotropic counter part. > > Thanks & Regards, > Rohin > > Y.Rohin Kumar, > +919818092877 <+91%2098180%2092877>. > > On Tue, Aug 1, 2017 at 5:18 PM, Rohin Kumar wrote: > >> Dear Jake, >> >> Thanks for your response. I meant to group/count pairs in boxes (using >> two arrays simultaneously-hence needing 2 metrics) instead of one distance >> array as the binning parameter. I don't know if the algorithm supports such >> a thing. For now, I am proceeding with your suggestion of two ball trees at >> huge computational cost. I hope I am able to frame my question properly. >> >> Thanks & Regards, >> Rohin. >> >> >> >> On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas < >> jakevdp at cs.washington.edu> wrote: >> >>> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar >>> wrote: >>> >>>> *update* >>>> >>>> May be it doesn't have to be done at the tree creation level. It could >>>> be using loops and creating two different balltrees. Something like >>>> >>>> tree1=BallTree(X,metric='metric1') #for x-z plane >>>> tree2=BallTree(X,metric='metric2') #for y-z plane >>>> >>>> And then calculate correlation functions in a loop to get tpcf(X,r1,r2) >>>> using tree1.two_point_correlation(X,r1) and >>>> tree2.two_point_correlation(X,r2) >>>> >>> >>> Hi Rohin, >>> It's not exactly clear to me what you wish the tree to do with the two >>> different metrics, but in any case the ball tree only supports one metric >>> at a time. If you can construct your desired result from two ball trees >>> each with its own metric, then that's probably the best way to proceed, >>> Jake >>> >>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yrohinkumar at gmail.com Tue Aug 1 13:50:58 2017 From: yrohinkumar at gmail.com (Rohin Kumar) Date: Tue, 1 Aug 2017 23:20:58 +0530 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures In-Reply-To: References: <379121501436421@mxfront4j.mail.yandex.net>

Message-ID: Dear Jake, Thank you for your prompt reply. I started with KD-tree but after realising it doesn't support custom metrics (I don't know the reason for this - would be nice feature) I shifted to BallTree and was looking for a 2 metric based categorisation. After looking around, the best I could find at most were brute-force methods written in python (had my own version too) or better optimised ones in C or FORTRAN. The closest one was halotools which again works with euclidean metric. For now, I will try to get my work done with 2 different BallTrees iteratively in bins. If I find a better option will try to post an update. Regards, Rohin. On Tue, Aug 1, 2017 at 10:55 PM, Jacob Vanderplas wrote: > Hi Rohin, > Ah, I see. I don't think a BallTree is the right data structure for an > anisotropic N-point query, because it fundamentally assumes spherical > symmetry of the metric. You may be able to do something like this with a > specialized KD-tree, but scikit-learn doesn't support this, and I don't > imagine that it ever will given the very specialized nature of the > application. > > I'm certain someone has written efficient code for this operation in the > astronomy community, but I don't know of any good Python package to > recommend for this ? I'd suggest googling for keywords and seeing where > that gets you. > > Thanks, > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Open Software > University of Washington eScience Institute > > On Tue, Aug 1, 2017 at 6:15 AM, Rohin Kumar wrote: > >> Since you seem to be from Astrophysics/Cosmology background (I am >> assuming you are jakevdp - the creator of astroML - if you are - I am >> lucky!), I can explain my application scenario. I am trying to calculate >> the anisotropic two-point correlation function something like done in >> rp_pi_tpcf >> >> or s_mu_tpcf >> >> using pairs (DD,DR,RR) calculated from BallTree.two_point_correlation >> >> In halotools (http://halotools.readthedocs.io/en/latest/function_usage/mo >> ck_observables_functions.html) it is implemented using rectangular >> grids. I could calculate 2pcf with custom metrics using one variable with >> BallTree as done in astroML. I intend to find the anisotropic counter part. >> >> Thanks & Regards, >> Rohin >> >> >> On Tue, Aug 1, 2017 at 5:18 PM, Rohin Kumar >> wrote: >> >>> Dear Jake, >>> >>> Thanks for your response. I meant to group/count pairs in boxes (using >>> two arrays simultaneously-hence needing 2 metrics) instead of one distance >>> array as the binning parameter. I don't know if the algorithm supports such >>> a thing. For now, I am proceeding with your suggestion of two ball trees at >>> huge computational cost. I hope I am able to frame my question properly. >>> >>> Thanks & Regards, >>> Rohin. >>> >>> >>> >>> On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas < >>> jakevdp at cs.washington.edu> wrote: >>> >>>> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar >>>> wrote: >>>> >>>>> *update* >>>>> >>>>> May be it doesn't have to be done at the tree creation level. It could >>>>> be using loops and creating two different balltrees. Something like >>>>> >>>>> tree1=BallTree(X,metric='metric1') #for x-z plane >>>>> tree2=BallTree(X,metric='metric2') #for y-z plane >>>>> >>>>> And then calculate correlation functions in a loop to get >>>>> tpcf(X,r1,r2) using tree1.two_point_correlation(X,r1) and >>>>> tree2.two_point_correlation(X,r2) >>>>> >>>> >>>> Hi Rohin, >>>> It's not exactly clear to me what you wish the tree to do with the two >>>> different metrics, but in any case the ball tree only supports one metric >>>> at a time. If you can construct your desired result from two ball trees >>>> each with its own metric, then that's probably the best way to proceed, >>>> Jake >>>> >>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jakevdp at cs.washington.edu Tue Aug 1 13:59:21 2017 From: jakevdp at cs.washington.edu (Jacob Vanderplas) Date: Tue, 1 Aug 2017 10:59:21 -0700 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures In-Reply-To: References: <379121501436421@mxfront4j.mail.yandex.net>

Message-ID: On Tue, Aug 1, 2017 at 10:50 AM, Rohin Kumar wrote: > I started with KD-tree but after realising it doesn't support custom > metrics (I don't know the reason for this - would be nice feature) > The scikit-learn KD-tree doesn't support custom metrics because it utilizes relatively strong assumptions about the form of the metric when constructing the tree. The Ball Tree makes fewer assumptions, which is why it can support arbitrary metrics. It would in principal be possible to create a KD Tree that supports custom *axis-aligned* metrics, but again I think that would be too specialized for inclusion in scikit-learn. One project you might check out is cykdtree: https://pypi.python.org/pypi/cykdtree I'm not certain whether it supports the queries you need, but I would bet the team behind that would be willing to work toward these sorts of specialized queries if they don't already exist. Jake > I shifted to BallTree and was looking for a 2 metric based categorisation. > After looking around, the best I could find at most were brute-force > methods written in python (had my own version too) or better optimised ones > in C or FORTRAN. The closest one was halotools which again works with > euclidean metric. For now, I will try to get my work done with 2 different > BallTrees iteratively in bins. If I find a better option will try to post > an update. > > Regards, > Rohin. > > > On Tue, Aug 1, 2017 at 10:55 PM, Jacob Vanderplas < > jakevdp at cs.washington.edu> wrote: > >> Hi Rohin, >> Ah, I see. I don't think a BallTree is the right data structure for an >> anisotropic N-point query, because it fundamentally assumes spherical >> symmetry of the metric. You may be able to do something like this with a >> specialized KD-tree, but scikit-learn doesn't support this, and I don't >> imagine that it ever will given the very specialized nature of the >> application. >> >> I'm certain someone has written efficient code for this operation in the >> astronomy community, but I don't know of any good Python package to >> recommend for this ? I'd suggest googling for keywords and seeing where >> that gets you. >> >> Thanks, >> Jake >> >> Jake VanderPlas >> Senior Data Science Fellow >> Director of Open Software >> University of Washington eScience Institute >> >> On Tue, Aug 1, 2017 at 6:15 AM, Rohin Kumar >> wrote: >> >>> Since you seem to be from Astrophysics/Cosmology background (I am >>> assuming you are jakevdp - the creator of astroML - if you are - I am >>> lucky!), I can explain my application scenario. I am trying to calculate >>> the anisotropic two-point correlation function something like done in >>> rp_pi_tpcf >>> >>> or s_mu_tpcf >>> >>> using pairs (DD,DR,RR) calculated from BallTree.two_point_correlation >>> >>> In halotools (http://halotools.readthedocs. >>> io/en/latest/function_usage/mock_observables_functions.html) it is >>> implemented using rectangular grids. I could calculate 2pcf with custom >>> metrics using one variable with BallTree as done in astroML. I intend to >>> find the anisotropic counter part. >>> >>> Thanks & Regards, >>> Rohin >>> >>> >>> On Tue, Aug 1, 2017 at 5:18 PM, Rohin Kumar >>> wrote: >>> >>>> Dear Jake, >>>> >>>> Thanks for your response. I meant to group/count pairs in boxes (using >>>> two arrays simultaneously-hence needing 2 metrics) instead of one distance >>>> array as the binning parameter. I don't know if the algorithm supports such >>>> a thing. For now, I am proceeding with your suggestion of two ball trees at >>>> huge computational cost. I hope I am able to frame my question properly. >>>> >>>> Thanks & Regards, >>>> Rohin. >>>> >>>> >>>> >>>> On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas < >>>> jakevdp at cs.washington.edu> wrote: >>>> >>>>> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar >>>>> wrote: >>>>> >>>>>> *update* >>>>>> >>>>>> May be it doesn't have to be done at the tree creation level. It >>>>>> could be using loops and creating two different balltrees. Something like >>>>>> >>>>>> tree1=BallTree(X,metric='metric1') #for x-z plane >>>>>> tree2=BallTree(X,metric='metric2') #for y-z plane >>>>>> >>>>>> And then calculate correlation functions in a loop to get >>>>>> tpcf(X,r1,r2) using tree1.two_point_correlation(X,r1) and >>>>>> tree2.two_point_correlation(X,r2) >>>>>> >>>>> >>>>> Hi Rohin, >>>>> It's not exactly clear to me what you wish the tree to do with the two >>>>> different metrics, but in any case the ball tree only supports one metric >>>>> at a time. If you can construct your desired result from two ball trees >>>>> each with its own metric, then that's probably the best way to proceed, >>>>> Jake >>>>> >>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sambarnett95 at gmail.com Wed Aug 2 08:38:50 2017 From: sambarnett95 at gmail.com (Sam Barnett) Date: Wed, 2 Aug 2017 13:38:50 +0100 Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline with a custom transformer Message-ID: Dear all, I have created a 2-step pipeline with a custom transformer followed by a simple SVC classifier, and I wish to run a grid-search over it. I am able to successfully create the transformer and the pipeline, and each of these elements work fine. However, when I try to use the fit() method on my GridSearchCV object, I get the following error: 57 # during fit. 58 if X.shape != self.input_shape_: ---> 59 raise ValueError('Shape of input is different from what was seen ' 60 'in `fit`') 61 ValueError: Shape of input is different from what was seen in `fit` For a full breakdown of the problem, I have written a Jupyter notebook showing exactly how the error occurs (this also contains all .py files necessary to run the notebook). Can anybody see how to work through this? Many thanks, Sam Barnett -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Sequential Kernel Test.zip Type: application/zip Size: 6759 bytes Desc: not available URL: From viewsonic234 at aim.com Wed Aug 2 11:36:24 2017 From: viewsonic234 at aim.com (Chris Carrion) Date: Wed, 2 Aug 2017 11:36:24 -0400 Subject: [scikit-learn] minibatchkmeans deprecation warning? Message-ID: <3xMy9f2YqXzFqm1@mail.python.org> Hi, I?m working in an environment provided by Quantopian, an algorithmic-traders hub for research. I imported the minibatch kmeans from sklearn.clusters in the environment they provided, but I?m getting a deprecation warning. After reaching out to Quantopian support, they claim it?s something with the way sklearn is coded, and nothing can be done on their end. I was wondering whether this was true or not. Curious, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Aug 2 12:05:17 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 2 Aug 2017 12:05:17 -0400 Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline with a custom transformer In-Reply-To: References: Message-ID: Hi Sam. GridSearchCV will do cross-validation, which requires to "transform" the test data. The shape of the test-data will be different from the shape of the training data. You need to have the ability to compute the kernel between the training data and new test data. A more hacky solution would be to compute the full kernel matrix in advance and pass that to GridSearchCV. You probably don't need it here, but you should also checkout what the _pairwise attribute does in cross-validation, because that it likely to come up when playing with kernels. Hth, Andy On 08/02/2017 08:38 AM, Sam Barnett wrote: > Dear all, > > I have created a 2-step pipeline with a custom transformer followed by > a simple SVC classifier, and I wish to run a grid-search over it. I am > able to successfully create the transformer and the pipeline, and each > of these elements work fine. However, when I try to use the fit() > method on my GridSearchCV object, I get the following error: > > 57 # during fit. > 58 if X.shape != self.input_shape_: > ---> 59 raise ValueError('Shape of input is different from > what was seen ' > 60 'in `fit`') > 61 > > ValueError: Shape of input is different from what was seen in `fit` > > For a full breakdown of the problem, I have written a Jupyter notebook > showing exactly how the error occurs (this also contains all .py files > necessary to run the notebook). Can anybody see how to work through this? > > Many thanks, > Sam Barnett > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Aug 2 12:05:44 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 2 Aug 2017 12:05:44 -0400 Subject: [scikit-learn] minibatchkmeans deprecation warning? In-Reply-To: <3xMy9f2YqXzFqm1@mail.python.org> References: <3xMy9f2YqXzFqm1@mail.python.org> Message-ID: Hi Chris. What is the warning? Andy On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote: > > Hi, > > I?m working in an environment provided by Quantopian, an > algorithmic-traders hub for research. I imported the minibatch kmeans > from sklearn.clusters in the environment they provided, but I?m > getting a deprecation warning. After reaching out to Quantopian > support, they claim it?s something with the way sklearn is coded, and > nothing can be done on their end. I was wondering whether this was > true or not. > > Curious, > > Chris > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From viewsonic234 at aim.com Wed Aug 2 12:10:30 2017 From: viewsonic234 at aim.com (Chris Carrion) Date: Wed, 2 Aug 2017 12:10:30 -0400 Subject: [scikit-learn] minibatchkmeans deprecation warning? In-Reply-To: References: <3xMy9f2YqXzFqm1@mail.python.org> Message-ID: <3xMypN0FjFzFqw2@mail.python.org> Hi Andy, WARN sklearn/cluster/k_means_.py:1301: DeprecationWarning: This function is deprecated. Please call randint(0, 179 + 1) instead That?s all I?m given From: Andreas Mueller Sent: Wednesday, August 2, 2017 12:09 PM To: Chris Carrion via scikit-learn Subject: Re: [scikit-learn] minibatchkmeans deprecation warning? Hi Chris. What is the warning? Andy On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote: Hi, ? I?m working in an environment provided by Quantopian, an algorithmic-traders hub for research. I imported the minibatch kmeans from sklearn.clusters in the environment they provided, but I?m getting a deprecation warning. After reaching out to Quantopian support, they claim it?s something with the way sklearn is coded, and nothing can be done on their end. I was wondering whether this was true or not. ? Curious, Chris _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Aug 2 12:32:03 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 2 Aug 2017 12:32:03 -0400 Subject: [scikit-learn] minibatchkmeans deprecation warning? In-Reply-To: <3xMypN0FjFzFqw2@mail.python.org> References: <3xMy9f2YqXzFqm1@mail.python.org> <3xMypN0FjFzFqw2@mail.python.org> Message-ID: <66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com> Ah. That's actually a deprecation warning coming from numpy, and it think it'll be removed in 0.19 (if not already in 0.18.1). It's really nothing to worry about, though. Andy On 08/02/2017 12:10 PM, Chris Carrion via scikit-learn wrote: > > Hi Andy, > > WARNsklearn/cluster/k_means_.py:1301: DeprecationWarning: This > function is deprecated. Please call randint(0, 179 + 1) instead > > That?s all I?m given > > *From: *Andreas Mueller > *Sent: *Wednesday, August 2, 2017 12:09 PM > *To: *Chris Carrion via scikit-learn > *Subject: *Re: [scikit-learn] minibatchkmeans deprecation warning? > > Hi Chris. > > What is the warning? > > Andy > > On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote: > > Hi, > > I?m working in an environment provided by Quantopian, an > algorithmic-traders hub for research. I imported the minibatch > kmeans from sklearn.clusters in the environment they provided, but > I?m getting a deprecation warning. After reaching out to > Quantopian support, they claim it?s something with the way sklearn > is coded, and nothing can be done on their end. I was wondering > whether this was true or not. > > Curious, > > Chris > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From viewsonic234 at aim.com Wed Aug 2 12:38:47 2017 From: viewsonic234 at aim.com (Chris Carrion) Date: Wed, 2 Aug 2017 12:38:47 -0400 Subject: [scikit-learn] minibatchkmeans deprecation warning? In-Reply-To: <66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com> References: <3xMy9f2YqXzFqm1@mail.python.org> <3xMypN0FjFzFqw2@mail.python.org> <66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com> Message-ID: <3xMzR06sl2zFqwt@mail.python.org> That?s great to hear, thanks! Chris From: Andreas Mueller Sent: Wednesday, August 2, 2017 12:34 PM To: Chris Carrion via scikit-learn Subject: Re: [scikit-learn] minibatchkmeans deprecation warning? Ah. That's actually a deprecation warning coming from numpy, and it think it'll be removed in 0.19 (if not already in 0.18.1). It's really nothing to worry about, though. Andy On 08/02/2017 12:10 PM, Chris Carrion via scikit-learn wrote: Hi Andy, WARN sklearn/cluster/k_means_.py:1301: DeprecationWarning: This function is deprecated. Please call randint(0, 179 + 1) instead ? That?s all I?m given From: Andreas Mueller Sent: Wednesday, August 2, 2017 12:09 PM To: Chris Carrion via scikit-learn Subject: Re: [scikit-learn] minibatchkmeans deprecation warning? ? Hi Chris. What is the warning? Andy On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote: Hi, ? I?m working in an environment provided by Quantopian, an algorithmic-traders hub for research. I imported the minibatch kmeans from sklearn.clusters in the environment they provided, but I?m getting a deprecation warning. After reaching out to Quantopian support, they claim it?s something with the way sklearn is coded, and nothing can be done on their end. I was wondering whether this was true or not. ? Curious, Chris _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn ? ? _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From viewsonic234 at aim.com Wed Aug 2 12:48:06 2017 From: viewsonic234 at aim.com (Chris Carrion) Date: Wed, 2 Aug 2017 12:48:06 -0400 Subject: [scikit-learn] minibatchkmeans deprecation warning? In-Reply-To: <66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com> References: <3xMy9f2YqXzFqm1@mail.python.org> <3xMypN0FjFzFqw2@mail.python.org> <66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com> Message-ID: <3xMzdn0TcWzFqVr@mail.python.org> Before I forget, is there an ETA for .19, or an average time between upgrades? From: Andreas Mueller Sent: Wednesday, August 2, 2017 12:34 PM To: Chris Carrion via scikit-learn Subject: Re: [scikit-learn] minibatchkmeans deprecation warning? Ah. That's actually a deprecation warning coming from numpy, and it think it'll be removed in 0.19 (if not already in 0.18.1). It's really nothing to worry about, though. Andy On 08/02/2017 12:10 PM, Chris Carrion via scikit-learn wrote: Hi Andy, WARN sklearn/cluster/k_means_.py:1301: DeprecationWarning: This function is deprecated. Please call randint(0, 179 + 1) instead ? That?s all I?m given From: Andreas Mueller Sent: Wednesday, August 2, 2017 12:09 PM To: Chris Carrion via scikit-learn Subject: Re: [scikit-learn] minibatchkmeans deprecation warning? ? Hi Chris. What is the warning? Andy On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote: Hi, ? I?m working in an environment provided by Quantopian, an algorithmic-traders hub for research. I imported the minibatch kmeans from sklearn.clusters in the environment they provided, but I?m getting a deprecation warning. After reaching out to Quantopian support, they claim it?s something with the way sklearn is coded, and nothing can be done on their end. I was wondering whether this was true or not. ? Curious, Chris _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn ? ? _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Aug 2 14:36:02 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 2 Aug 2017 14:36:02 -0400 Subject: [scikit-learn] minibatchkmeans deprecation warning? In-Reply-To: <3xMzdn0TcWzFqVr@mail.python.org> References: <3xMy9f2YqXzFqm1@mail.python.org> <3xMypN0FjFzFqw2@mail.python.org> <66043d0e-dce5-ebac-a100-31bc02760aa3@gmail.com> <3xMzdn0TcWzFqVr@mail.python.org> Message-ID: <31bc0362-f3b4-94af-b240-0a1d4bb9e7e0@gmail.com> The docs say 3 month, I think. Though it's been more like 8. 0.19 will come out in August. On 08/02/2017 12:48 PM, Chris Carrion via scikit-learn wrote: > > Before I forget, is there an ETA for .19, or an average time between > upgrades? > > *From: *Andreas Mueller > *Sent: *Wednesday, August 2, 2017 12:34 PM > *To: *Chris Carrion via scikit-learn > *Subject: *Re: [scikit-learn] minibatchkmeans deprecation warning? > > Ah. > That's actually a deprecation warning coming from numpy, and it think > it'll be removed in 0.19 (if not already in 0.18.1). > It's really nothing to worry about, though. > > Andy > > On 08/02/2017 12:10 PM, Chris Carrion via scikit-learn wrote: > > Hi Andy, > > WARNsklearn/cluster/k_means_.py:1301: DeprecationWarning: This > function is deprecated. Please call randint(0, 179 + 1) instead > > That?s all I?m given > > *From: *Andreas Mueller > *Sent: *Wednesday, August 2, 2017 12:09 PM > *To: *Chris Carrion via scikit-learn > *Subject: *Re: [scikit-learn] minibatchkmeans deprecation warning? > > Hi Chris. > > What is the warning? > > Andy > > On 08/02/2017 11:36 AM, Chris Carrion via scikit-learn wrote: > > Hi, > > I?m working in an environment provided by Quantopian, an > algorithmic-traders hub for research. I imported the minibatch > kmeans from sklearn.clusters in the environment they provided, > but I?m getting a deprecation warning. After reaching out to > Quantopian support, they claim it?s something with the way > sklearn is coded, and nothing can be done on their end. I was > wondering whether this was true or not. > > Curious, > > Chris > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From sambarnett95 at gmail.com Wed Aug 2 15:08:07 2017 From: sambarnett95 at gmail.com (Sam Barnett) Date: Wed, 2 Aug 2017 20:08:07 +0100 Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline with a custom transformer In-Reply-To: References:

Message-ID: Hi Andy, The purpose of the transformer is to take an ordinary kernel (in this case I have taken 'rbf' as a default) and return a 'sequentialised' kernel using a few extra parameters. Hence, the transformer takes an ordinary data-target pair X, y as its input, and the fit_transform(X, y) method will output the Gram matrix for X that is associated with this sequentialised kernel. In the pipeline, this Gram matrix is passed into an SVC classifier with the kernel parameter set to 'precomputed'. Therefore, I do not think your hacky solution would be possible. However, I am still unsure how to implement your first solution: won't the Gram matrix from the transformer contain all the necessary kernel values? Could you elaborate further? Best, Sam On Wed, Aug 2, 2017 at 5:05 PM, Andreas Mueller wrote: > Hi Sam. > GridSearchCV will do cross-validation, which requires to "transform" the > test data. > The shape of the test-data will be different from the shape of the > training data. > You need to have the ability to compute the kernel between the training > data and new test data. > > A more hacky solution would be to compute the full kernel matrix in > advance and pass that to GridSearchCV. > > You probably don't need it here, but you should also checkout what the > _pairwise attribute does in cross-validation, > because that it likely to come up when playing with kernels. > > Hth, > Andy > > > On 08/02/2017 08:38 AM, Sam Barnett wrote: > > Dear all, > > I have created a 2-step pipeline with a custom transformer followed by a > simple SVC classifier, and I wish to run a grid-search over it. I am able > to successfully create the transformer and the pipeline, and each of these > elements work fine. However, when I try to use the fit() method on my > GridSearchCV object, I get the following error: > > 57 # during fit. > 58 if X.shape != self.input_shape_: > ---> 59 raise ValueError('Shape of input is different from > what was seen ' > 60 'in `fit`') > 61 > > ValueError: Shape of input is different from what was seen in `fit` > > For a full breakdown of the problem, I have written a Jupyter notebook > showing exactly how the error occurs (this also contains all .py files > necessary to run the notebook). Can anybody see how to work through this? > > Many thanks, > Sam Barnett > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pybokeh at gmail.com Wed Aug 2 22:01:36 2017 From: pybokeh at gmail.com (pybokeh) Date: Wed, 2 Aug 2017 22:01:36 -0400 Subject: [scikit-learn] Help With Text Classification Message-ID: Hello, I am studying this example from scikit-learn's site: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_ data.html The problem that I need to solve is very similar to this example, except I have one additional feature column (part #) that is categorical of type string. My label or target values consist of just 2 values: 0 or 1. With that additional feature column, I am transforming it with a LabelEncoder and then I am encoding it with the OneHotEncoder. Then I am concatenating that one-hot encoded column (part #) to the text/document feature column (complaint), which I had applied the CountVectorizer and TfidfTransformer transformations. Then I chose the MultinomialNB model to fit my concatenated training data with. The problem I run into is when I invoke the prediction, I get a dimension mis-match error. Here's my jupyter notebook gist: http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85ef86ba41424b311 I would gladly appreciate it if someone can guide me where I went wrong. Thanks! - Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Aug 2 22:38:34 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 3 Aug 2017 12:38:34 +1000 Subject: [scikit-learn] Help With Text Classification In-Reply-To: References: Message-ID: Use a Pipeline to help avoid this kind of issue (and others). You might also want to do something like http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html On 3 August 2017 at 12:01, pybokeh wrote: > Hello, > I am studying this example from scikit-learn's site: > http://scikit-learn.org/stable/tutorial/text_analytics/ > working_with_text_data.html > > The problem that I need to solve is very similar to this example, except I > have one > additional feature column (part #) that is categorical of type string. My > label or target > values consist of just 2 values: 0 or 1. > > With that additional feature column, I am transforming it with a > LabelEncoder and > then I am encoding it with the OneHotEncoder. > > Then I am concatenating that one-hot encoded column (part #) to the > text/document > feature column (complaint), which I had applied the CountVectorizer and > TfidfTransformer transformations. > > Then I chose the MultinomialNB model to fit my concatenated training data > with. > > The problem I run into is when I invoke the prediction, I get a dimension > mis-match error. > > Here's my jupyter notebook gist: > http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85 > ef86ba41424b311 > > I would gladly appreciate it if someone can guide me where I went wrong. > Thanks! > > - Daniel > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pybokeh at gmail.com Wed Aug 2 23:12:36 2017 From: pybokeh at gmail.com (pybokeh) Date: Wed, 2 Aug 2017 23:12:36 -0400 Subject: [scikit-learn] Help With Text Classification In-Reply-To: References: Message-ID: Thanks Joel for recommending FeatureUnion. I did run across that. But for just 2 features, I thought that might be overkill. I am aware of Pipeline which the scikit-learn example explains very well, which I was going to utilize once I finalize my script. I did not want to abstract away too much early on since I am in the beginning stages of learning machine learning and scikit-learn. - Daniel On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman wrote: > Use a Pipeline to help avoid this kind of issue (and others). You might > also want to do something like http://scikit-learn.org/ > stable/auto_examples/hetero_feature_union.html > > On 3 August 2017 at 12:01, pybokeh wrote: > >> Hello, >> I am studying this example from scikit-learn's site: >> http://scikit-learn.org/stable/tutorial/text_analytics/worki >> ng_with_text_data.html >> >> The problem that I need to solve is very similar to this example, except >> I have one >> additional feature column (part #) that is categorical of type string. >> My label or target >> values consist of just 2 values: 0 or 1. >> >> With that additional feature column, I am transforming it with a >> LabelEncoder and >> then I am encoding it with the OneHotEncoder. >> >> Then I am concatenating that one-hot encoded column (part #) to the >> text/document >> feature column (complaint), which I had applied the CountVectorizer and >> TfidfTransformer transformations. >> >> Then I chose the MultinomialNB model to fit my concatenated training data >> with. >> >> The problem I run into is when I invoke the prediction, I get a dimension >> mis-match error. >> >> Here's my jupyter notebook gist: >> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85 >> ef86ba41424b311 >> >> I would gladly appreciate it if someone can guide me where I went wrong. >> Thanks! >> >> - Daniel >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yrohinkumar at gmail.com Wed Aug 2 23:42:58 2017 From: yrohinkumar at gmail.com (Rohin Kumar) Date: Thu, 3 Aug 2017 09:12:58 +0530 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures In-Reply-To: References: <379121501436421@mxfront4j.mail.yandex.net>

Message-ID: Dear Jake, Thank you for your inputs. Had a look at cykdtree. Core implementation of the algorithm is in C/C++ modifying which is currently beyond my skill. Will try to contact their team if they entertain special requests. I should be able fork and modify the sklearn algorithm in cython once my current project is complete. Currently going ahead with brute-force method. For now, this thread may be considered closed. Thanks once again! Regards, Rohin. On Tue, Aug 1, 2017 at 11:29 PM, Jacob Vanderplas wrote: > On Tue, Aug 1, 2017 at 10:50 AM, Rohin Kumar > wrote: > >> I started with KD-tree but after realising it doesn't support custom >> metrics (I don't know the reason for this - would be nice feature) >> > > The scikit-learn KD-tree doesn't support custom metrics because it > utilizes relatively strong assumptions about the form of the metric when > constructing the tree. The Ball Tree makes fewer assumptions, which is why > it can support arbitrary metrics. It would in principal be possible to > create a KD Tree that supports custom *axis-aligned* metrics, but again I > think that would be too specialized for inclusion in scikit-learn. > > One project you might check out is cykdtree: https://pypi.python. > org/pypi/cykdtree > I'm not certain whether it supports the queries you need, but I would bet > the team behind that would be willing to work toward these sorts of > specialized queries if they don't already exist. > > Jake > > > > >> I shifted to BallTree and was looking for a 2 metric based >> categorisation. After looking around, the best I could find at most were >> brute-force methods written in python (had my own version too) or better >> optimised ones in C or FORTRAN. The closest one was halotools which again >> works with euclidean metric. For now, I will try to get my work done with 2 >> different BallTrees iteratively in bins. If I find a better option will try >> to post an update. >> >> Regards, >> Rohin. >> >> >> On Tue, Aug 1, 2017 at 10:55 PM, Jacob Vanderplas < >> jakevdp at cs.washington.edu> wrote: >> >>> Hi Rohin, >>> Ah, I see. I don't think a BallTree is the right data structure for an >>> anisotropic N-point query, because it fundamentally assumes spherical >>> symmetry of the metric. You may be able to do something like this with a >>> specialized KD-tree, but scikit-learn doesn't support this, and I don't >>> imagine that it ever will given the very specialized nature of the >>> application. >>> >>> I'm certain someone has written efficient code for this operation in the >>> astronomy community, but I don't know of any good Python package to >>> recommend for this ? I'd suggest googling for keywords and seeing where >>> that gets you. >>> >>> Thanks, >>> Jake >>> >>> Jake VanderPlas >>> Senior Data Science Fellow >>> Director of Open Software >>> University of Washington eScience Institute >>> >>> On Tue, Aug 1, 2017 at 6:15 AM, Rohin Kumar >>> wrote: >>> >>>> Since you seem to be from Astrophysics/Cosmology background (I am >>>> assuming you are jakevdp - the creator of astroML - if you are - I am >>>> lucky!), I can explain my application scenario. I am trying to calculate >>>> the anisotropic two-point correlation function something like done in >>>> rp_pi_tpcf >>>> >>>> or s_mu_tpcf >>>> >>>> using pairs (DD,DR,RR) calculated from BallTree.two_point_correlation >>>> >>>> In halotools (http://halotools.readthedocs. >>>> io/en/latest/function_usage/mock_observables_functions.html) it is >>>> implemented using rectangular grids. I could calculate 2pcf with custom >>>> metrics using one variable with BallTree as done in astroML. I intend to >>>> find the anisotropic counter part. >>>> >>>> Thanks & Regards, >>>> Rohin >>>> >>>> >>>> On Tue, Aug 1, 2017 at 5:18 PM, Rohin Kumar >>>> wrote: >>>> >>>>> Dear Jake, >>>>> >>>>> Thanks for your response. I meant to group/count pairs in boxes (using >>>>> two arrays simultaneously-hence needing 2 metrics) instead of one distance >>>>> array as the binning parameter. I don't know if the algorithm supports such >>>>> a thing. For now, I am proceeding with your suggestion of two ball trees at >>>>> huge computational cost. I hope I am able to frame my question properly. >>>>> >>>>> Thanks & Regards, >>>>> Rohin. >>>>> >>>>> >>>>> >>>>> On Mon, Jul 31, 2017 at 8:16 PM, Jacob Vanderplas < >>>>> jakevdp at cs.washington.edu> wrote: >>>>> >>>>>> On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar >>>>>> wrote: >>>>>> >>>>>>> *update* >>>>>>> >>>>>>> May be it doesn't have to be done at the tree creation level. It >>>>>>> could be using loops and creating two different balltrees. Something like >>>>>>> >>>>>>> tree1=BallTree(X,metric='metric1') #for x-z plane >>>>>>> tree2=BallTree(X,metric='metric2') #for y-z plane >>>>>>> >>>>>>> And then calculate correlation functions in a loop to get >>>>>>> tpcf(X,r1,r2) using tree1.two_point_correlation(X,r1) and >>>>>>> tree2.two_point_correlation(X,r2) >>>>>>> >>>>>> >>>>>> Hi Rohin, >>>>>> It's not exactly clear to me what you wish the tree to do with the >>>>>> two different metrics, but in any case the ball tree only supports one >>>>>> metric at a time. If you can construct your desired result from two ball >>>>>> trees each with its own metric, then that's probably the best way to >>>>>> proceed, >>>>>> Jake >>>>>> >>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Aug 3 00:54:18 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 3 Aug 2017 14:54:18 +1000 Subject: [scikit-learn] Help With Text Classification In-Reply-To: References:

Message-ID: One of the key advantages of Pipeline is that it makes sure that equivalent processing happens at training and prediction time (assuming you do not write your own transformers that break their contract). This is what appears to have broken in your current attempts. On 3 August 2017 at 13:12, pybokeh wrote: > Thanks Joel for recommending FeatureUnion. I did run across that. But > for just 2 features, I thought that might be overkill. I am aware of > Pipeline which the scikit-learn example explains very well, which I was > going to utilize once I finalize my script. I did not want to abstract > away too much early on since I am in the beginning stages of learning > machine learning and scikit-learn. > > - Daniel > > On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman > wrote: > >> Use a Pipeline to help avoid this kind of issue (and others). You might >> also want to do something like http://scikit-learn.org/stable >> /auto_examples/hetero_feature_union.html >> >> On 3 August 2017 at 12:01, pybokeh wrote: >> >>> Hello, >>> I am studying this example from scikit-learn's site: >>> http://scikit-learn.org/stable/tutorial/text_analytics/worki >>> ng_with_text_data.html >>> >>> The problem that I need to solve is very similar to this example, except >>> I have one >>> additional feature column (part #) that is categorical of type string. >>> My label or target >>> values consist of just 2 values: 0 or 1. >>> >>> With that additional feature column, I am transforming it with a >>> LabelEncoder and >>> then I am encoding it with the OneHotEncoder. >>> >>> Then I am concatenating that one-hot encoded column (part #) to the >>> text/document >>> feature column (complaint), which I had applied the CountVectorizer and >>> TfidfTransformer transformations. >>> >>> Then I chose the MultinomialNB model to fit my concatenated training >>> data with. >>> >>> The problem I run into is when I invoke the prediction, I get a >>> dimension mis-match error. >>> >>> Here's my jupyter notebook gist: >>> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85 >>> ef86ba41424b311 >>> >>> I would gladly appreciate it if someone can guide me where I went >>> wrong. Thanks! >>> >>> - Daniel >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abhishekraj10 at yahoo.com Thu Aug 3 06:15:50 2017 From: abhishekraj10 at yahoo.com (Abhishek Raj) Date: Thu, 3 Aug 2017 15:45:50 +0530 Subject: [scikit-learn] OneClassSvm | Different results on different runs Message-ID: Hi, I am using one class svm for developing an anomaly detection model. I observed that different runs of training on the same data set outputs different accuracy. One run takes the accuracy as high as 98% and another run on the same data brings it down to 93%. Googling a little bit I found out that this is happening because of the random_state parameter but I am not clear of the details. Can anyone expand on how is the parameter exactly affecting my training and how I can figure out the best value to get the model with best accuracy? Thanks, Abhishek -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaquesgrobler at gmail.com Thu Aug 3 06:39:44 2017 From: jaquesgrobler at gmail.com (Jaques Grobler) Date: Thu, 3 Aug 2017 12:39:44 +0200 Subject: [scikit-learn] OneClassSvm | Different results on different runs In-Reply-To: References: Message-ID: Hi, The random_state parameter is used to generate a pseudo random number that is used when shuffling your data for probability estimation The seed of the pseudo random number generator to use when shuffling the data for probability estimation. A seed can be provided to control the shuffling for reproducible behavior. Also, from the SVM docs The underlying LinearSVC > implementation > uses a random number generator to select features when fitting the model. > It is thus not uncommon, to have slightly different results for the same > input data. If that happens, try with a smaller *tol *parameter. Hope that helps 2017-08-03 12:15 GMT+02:00 Abhishek Raj via scikit-learn < scikit-learn at python.org>: > Hi, > > I am using one class svm for developing an anomaly detection model. I > observed that different runs of training on the same data set outputs > different accuracy. One run takes the accuracy as high as 98% and another > run on the same data brings it down to 93%. Googling a little bit I found > out that this is happening because of the random_state > parameter > but I am not clear of the details. > > Can anyone expand on how is the parameter exactly affecting my training > and how I can figure out the best value to get the model with best accuracy? > > Thanks, > Abhishek > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From albertthomas88 at gmail.com Thu Aug 3 07:26:17 2017 From: albertthomas88 at gmail.com (Albert Thomas) Date: Thu, 03 Aug 2017 11:26:17 +0000 Subject: [scikit-learn] OneClassSvm | Different results on different runs In-Reply-To: References:

Message-ID: Hi Abhishek, Could you provide a small code snippet? I don't think the random_state parameter should influence the result of the OneClassSVM as there is no probability estimation for this estimator. Albert On Thu, Aug 3, 2017 at 12:41 PM Jaques Grobler wrote: > Hi, > > The random_state parameter is used to generate a pseudo random number that > is used when shuffling your data for probability estimation > > The seed of the pseudo random number generator to use when shuffling the > data for probability estimation. > A seed can be provided to control the shuffling for reproducible behavior. > > Also, from the SVM docs > > > The underlying LinearSVC >> implementation >> uses a random number generator to select features when fitting the model. >> It is thus not uncommon, to have slightly different results for the same >> input data. If that happens, try with a smaller *tol *parameter. > > > Hope that helps > > 2017-08-03 12:15 GMT+02:00 Abhishek Raj via scikit-learn < > scikit-learn at python.org>: > >> Hi, >> >> I am using one class svm for developing an anomaly detection model. I >> observed that different runs of training on the same data set outputs >> different accuracy. One run takes the accuracy as high as 98% and another >> run on the same data brings it down to 93%. Googling a little bit I found >> out that this is happening because of the random_state >> parameter >> but I am not clear of the details. >> >> Can anyone expand on how is the parameter exactly affecting my training >> and how I can figure out the best value to get the model with best accuracy? >> >> Thanks, >> Abhishek >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From goix.nicolas at gmail.com Thu Aug 3 07:54:37 2017 From: goix.nicolas at gmail.com (Nicolas Goix) Date: Thu, 3 Aug 2017 13:54:37 +0200 Subject: [scikit-learn] OneClassSvm | Different results on different runs In-Reply-To: References:

Message-ID: @albertcthomas isn't there some randomness in SMO which could influence the result if the tolerance parameter is too large? On Aug 3, 2017 1:28 PM, "Albert Thomas" wrote: > Hi Abhishek, > > Could you provide a small code snippet? I don't think the random_state > parameter should influence the result of the OneClassSVM as there is no > probability estimation for this estimator. > > Albert > > On Thu, Aug 3, 2017 at 12:41 PM Jaques Grobler > wrote: > >> Hi, >> >> The random_state parameter is used to generate a pseudo random number >> that is used when shuffling your data for probability estimation >> >> The seed of the pseudo random number generator to use when shuffling the >> data for probability estimation. >> A seed can be provided to control the shuffling for reproducible behavior. >> >> Also, from the SVM docs >> >> >> The underlying LinearSVC >>> >>> implementation uses a random number generator to select features when >>> fitting the model. It is thus not uncommon, to have slightly different >>> results for the same input data. If that happens, try with a smaller *tol >>> *parameter. >> >> >> Hope that helps >> >> 2017-08-03 12:15 GMT+02:00 Abhishek Raj via scikit-learn < >> scikit-learn at python.org>: >> >>> Hi, >>> >>> I am using one class svm for developing an anomaly detection model. I >>> observed that different runs of training on the same data set outputs >>> different accuracy. One run takes the accuracy as high as 98% and another >>> run on the same data brings it down to 93%. Googling a little bit I found >>> out that this is happening because of the random_state >>> parameter >>> but I am not clear of the details. >>> >>> Can anyone expand on how is the parameter exactly affecting my training >>> and how I can figure out the best value to get the model with best accuracy? >>> >>> Thanks, >>> Abhishek >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From albertthomas88 at gmail.com Thu Aug 3 09:17:38 2017 From: albertthomas88 at gmail.com (Albert Thomas) Date: Thu, 03 Aug 2017 13:17:38 +0000 Subject: [scikit-learn] OneClassSvm | Different results on different runs In-Reply-To: References:

Message-ID: Yes, in fact, changing the random_state might have an influence on the result. The docstring of the random_state parameter for the OneClassSVM seems incorrect though... Albert On Thu, Aug 3, 2017 at 1:55 PM Nicolas Goix wrote: > @albertcthomas isn't there some randomness in SMO which could influence > the result if the tolerance parameter is too large? > > On Aug 3, 2017 1:28 PM, "Albert Thomas" wrote: > >> Hi Abhishek, >> >> Could you provide a small code snippet? I don't think the random_state >> parameter should influence the result of the OneClassSVM as there is no >> probability estimation for this estimator. >> >> Albert >> >> On Thu, Aug 3, 2017 at 12:41 PM Jaques Grobler >> wrote: >> >>> Hi, >>> >>> The random_state parameter is used to generate a pseudo random number >>> that is used when shuffling your data for probability estimation >>> >>> The seed of the pseudo random number generator to use when shuffling the >>> data for probability estimation. >>> A seed can be provided to control the shuffling for reproducible >>> behavior. >>> >>> Also, from the SVM docs >>> >>> >>> The underlying LinearSVC >>>> implementation >>>> uses a random number generator to select features when fitting the model. >>>> It is thus not uncommon, to have slightly different results for the same >>>> input data. If that happens, try with a smaller *tol *parameter. >>> >>> >>> Hope that helps >>> >>> 2017-08-03 12:15 GMT+02:00 Abhishek Raj via scikit-learn < >>> scikit-learn at python.org>: >>> >>>> Hi, >>>> >>>> I am using one class svm for developing an anomaly detection model. I >>>> observed that different runs of training on the same data set outputs >>>> different accuracy. One run takes the accuracy as high as 98% and another >>>> run on the same data brings it down to 93%. Googling a little bit I found >>>> out that this is happening because of the random_state >>>> parameter >>>> but I am not clear of the details. >>>> >>>> Can anyone expand on how is the parameter exactly affecting my training >>>> and how I can figure out the best value to get the model with best accuracy? >>>> >>>> Thanks, >>>> Abhishek >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.waseem.ahmad at gmail.com Thu Aug 3 10:37:02 2017 From: m.waseem.ahmad at gmail.com (muhammad waseem) Date: Thu, 3 Aug 2017 15:37:02 +0100 Subject: [scikit-learn] Extra trees tuning parameters Message-ID: Hi All, I was wondering if you could please tell me what is the "nmin , the minimum sample size for splitting a node" (referred by Geurts et al., 2006) in scikit-learn API for Extra trees? Is it min_samples_split in skearn? Regards Waseem -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.duprelatour at orange.fr Thu Aug 3 11:18:17 2017 From: tom.duprelatour at orange.fr (Tom DLT) Date: Thu, 3 Aug 2017 17:18:17 +0200 Subject: [scikit-learn] question about class_weights in LogisticRegression In-Reply-To: References:

Message-ID: The class weights and sample weights are used in the same way, as a factor specific to each sample, in the loss function. In LogisticRegression, it is equivalent to incorporate this factor into a regularization parameter C specific to each sample. Tom 2017-08-01 18:30 GMT+02:00 Johnson, Jeremiah : > Right, I know how the class_weight calculation is performed. But then > those class weights are utilized during the model fit process in some way > in liblinear, and that?s what I am interested in. libSVM does > class_weight[I] * C (https://www.csie.ntu.edu.tw/~cjlin/libsvm/); is the > implementation in liblinear the same? > > Best, > Jeremiah > > > > On 8/1/17, 12:19 PM, "scikit-learn on behalf of Stuart Reynolds" > stuart at stuartreynolds.net> wrote: > > >I hope not. And not accoring to the docs... > >https://urldefense.proofpoint.com/v2/url?u=https- > 3A__github.com_scikit-2Dl > >earn_scikit-2Dlearn_blob_ab93d65_sklearn_linear- > 5Fmodel_logistic.py-23L947 > >&d=DwIGaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTO > cTEjhIk > >rRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_FwzMKMwnvEOWj8o&s= > 4uJZS3EaQgysmQlzjt- > >yuLkSlcXTd5G50LkEFMcbZLQ&e= > > > >class_weight : dict or 'balanced', optional > >Weights associated with classes in the form ``{class_label: weight}``. > >If not given, all classes are supposed to have weight one. > >The "balanced" mode uses the values of y to automatically adjust > >weights inversely proportional to class frequencies in the input data > >as ``n_samples / (n_classes * np.bincount(y))``. > >Note that these weights will be multiplied with sample_weight (passed > >through the fit method) if sample_weight is specified. > > > >On Tue, Aug 1, 2017 at 9:03 AM, Johnson, Jeremiah > > wrote: > >> Hello all, > >> > >> I?m looking for confirmation on an implementation detail that is > >>somewhere > >> in liblinear, but I haven?t found documentation for yet. When the > >> class_weights=?balanced? parameter is set in LogisticRegression, then > >>the > >> regularisation parameter for an observation from class I is > >>class_weight[I] > >> * C, where C is the usual regularization parameter ? is this correct? > >> > >> Thanks, > >> Jeremiah > >> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> > >>https://urldefense.proofpoint.com/v2/url?u=https- > 3A__mail.python.org_mail > >>man_listinfo_scikit-2Dlearn&d=DwIGaQ&c=c6MrceVCY5m5A_ > KAUkrdoA&r=hQNTLb4Jo > >>nm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_ > FwzMKMwn > >>vEOWj8o&s=MgZoI9VOHFh3omGKHTASFx3NAVjj6AY3j_75mnOUg04&e= > >> > >_______________________________________________ > >scikit-learn mailing list > >scikit-learn at python.org > >https://urldefense.proofpoint.com/v2/url?u=https- > 3A__mail.python.org_mailm > >an_listinfo_scikit-2Dlearn&d=DwIGaQ&c=c6MrceVCY5m5A_ > KAUkrdoA&r=hQNTLb4Jonm > >4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=2XR2z3VWvEaERt4miGabDte3xkz_ > FwzMKMwnvEO > >Wj8o&s=MgZoI9VOHFh3omGKHTASFx3NAVjj6AY3j_75mnOUg04&e= > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Aug 3 12:12:12 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 3 Aug 2017 12:12:12 -0400 Subject: [scikit-learn] OneClassSvm | Different results on different runs In-Reply-To: References:

Message-ID: On 08/03/2017 09:17 AM, Albert Thomas wrote: > Yes, in fact, changing the random_state might have an influence on the > result. The docstring of the random_state parameter for the > OneClassSVM seems incorrect though... PR or issue welcome. From t3kcit at gmail.com Thu Aug 3 13:35:46 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 3 Aug 2017 13:35:46 -0400 Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline with a custom transformer In-Reply-To: References:

Message-ID: Hi Sam. You need to put these into a reachable namespace (possibly as private functions) so that they can be pickled. Please stay on the sklearn mailing list, I might not have time to reply. Andy On 08/03/2017 01:24 PM, Sam Barnett wrote: > Hi Andy, > > I've since tried a different solution: instead of a pipeline, I've > simply created a classifier that is for the most part like svm.SVC, > though it takes a few extra inputs for the sequentialisation step. > I've used a Python function that can compute the Gram matrix between > two datasets of any shape to pass into SVC(), though I'm now having > trouble with pickling on the check_estimator test. It appears that > SeqSVC.fit() doesn't like to have methods defined within it. Can you > see how to pass this test? (the .ipynb file shows the error). > > Best, > Sam > > On Wed, Aug 2, 2017 at 9:44 PM, Sam Barnett > wrote: > > You're right: it does fail without GridSearchCV when I change the > size of seq_test. I will look at the transform tomorrow to see if > I can work this out. Thank you for your help so far! > > On Wed, Aug 2, 2017 at 9:20 PM, Andreas Mueller > wrote: > > Change the size of seq_test in your notebook and you'll see > the failure without GridSearchCV. > I haven't looked at your code in detail, but transform is > supposed to work on arbitrary new data with the same number of > features. > Your code requires the test data to have the same shape as the > training data. > Cross-validation will lead to training data and test data > having different sizes. But I feel like something is already > wrong if your > test data size depends on your training data size. > > > > On 08/02/2017 03:08 PM, Sam Barnett wrote: >> Hi Andy, >> >> The purpose of the transformer is to take an ordinary kernel >> (in this case I have taken 'rbf' as a default) and return a >> 'sequentialised' kernel using a few extra parameters. Hence, >> the transformer takes an ordinary data-target pair X, y as >> its input, and the fit_transform(X, y) method will output the >> Gram matrix for X that is associated with this sequentialised >> kernel. In the pipeline, this Gram matrix is passed into an >> SVC classifier with the kernel parameter set to 'precomputed'. >> >> Therefore, I do not think your hacky solution would be >> possible. However, I am still unsure how to implement your >> first solution: won't the Gram matrix from the transformer >> contain all the necessary kernel values? Could you elaborate >> further? >> >> >> Best, >> Sam >> >> On Wed, Aug 2, 2017 at 5:05 PM, Andreas Mueller >> > wrote: >> >> Hi Sam. >> GridSearchCV will do cross-validation, which requires to >> "transform" the test data. >> The shape of the test-data will be different from the >> shape of the training data. >> You need to have the ability to compute the kernel >> between the training data and new test data. >> >> A more hacky solution would be to compute the full kernel >> matrix in advance and pass that to GridSearchCV. >> >> You probably don't need it here, but you should also >> checkout what the _pairwise attribute does in >> cross-validation, >> because that it likely to come up when playing with kernels. >> >> Hth, >> Andy >> >> >> On 08/02/2017 08:38 AM, Sam Barnett wrote: >>> Dear all, >>> >>> I have created a 2-step pipeline with a custom >>> transformer followed by a simple SVC classifier, and I >>> wish to run a grid-search over it. I am able to >>> successfully create the transformer and the pipeline, >>> and each of these elements work fine. However, when I >>> try to use the fit() method on my GridSearchCV object, I >>> get the following error: >>> >>> 57 # during fit. >>> 58 if X.shape != self.input_shape_: >>> ---> 59 raise ValueError('Shape of input is >>> different from what was seen ' >>> 60 'in `fit`') >>> 61 >>> >>> ValueError: Shape of input is different from what was >>> seen in `fit` >>> >>> For a full breakdown of the problem, I have written a >>> Jupyter notebook showing exactly how the error occurs >>> (this also contains all .py files necessary to run the >>> notebook). Can anybody see how to work through this? >>> >>> Many thanks, >>> Sam Barnett >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pybokeh at gmail.com Thu Aug 3 17:48:26 2017 From: pybokeh at gmail.com (pybokeh) Date: Thu, 3 Aug 2017 17:48:26 -0400 Subject: [scikit-learn] Help With Text Classification In-Reply-To: References: Message-ID: I found my problem. When I one-hot encoded my test part #, it resulted in being a 1x1 matrix, when I need it to be a 1x153. This happened because I used the default setting ('auto') for n_values, when I needed it set it to 153. Now when I horizontally stacked it to my other feature matrix, the resulting total # of columns now correctly comes to 1294, instead of 1142. Looking back now, not sure if using Pipeline or using FeatureUnion would have helped in this case or prevented this since this error occurred on the prediction side, not on training or modeling side. On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman wrote: > Use a Pipeline to help avoid this kind of issue (and others). You might > also want to do something like http://scikit-learn.org/ > stable/auto_examples/hetero_feature_union.html > > On 3 August 2017 at 12:01, pybokeh wrote: > >> Hello, >> I am studying this example from scikit-learn's site: >> http://scikit-learn.org/stable/tutorial/text_analytics/worki >> ng_with_text_data.html >> >> The problem that I need to solve is very similar to this example, except >> I have one >> additional feature column (part #) that is categorical of type string. >> My label or target >> values consist of just 2 values: 0 or 1. >> >> With that additional feature column, I am transforming it with a >> LabelEncoder and >> then I am encoding it with the OneHotEncoder. >> >> Then I am concatenating that one-hot encoded column (part #) to the >> text/document >> feature column (complaint), which I had applied the CountVectorizer and >> TfidfTransformer transformations. >> >> Then I chose the MultinomialNB model to fit my concatenated training data >> with. >> >> The problem I run into is when I invoke the prediction, I get a dimension >> mis-match error. >> >> Here's my jupyter notebook gist: >> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85 >> ef86ba41424b311 >> >> I would gladly appreciate it if someone can guide me where I went wrong. >> Thanks! >> >> - Daniel >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Aug 3 18:29:10 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 4 Aug 2017 08:29:10 +1000 Subject: [scikit-learn] Help With Text Classification In-Reply-To: References:

Message-ID: pipeline helps in prediction time too. On 4 Aug 2017 7:49 am, "pybokeh" wrote: > I found my problem. When I one-hot encoded my test part #, it resulted in > being a 1x1 matrix, when I need it to be a 1x153. This happened because I > used the default setting ('auto') for n_values, when I needed it set it to > 153. Now when I horizontally stacked it to my other feature matrix, the > resulting total # of columns now correctly comes to 1294, instead of > 1142. Looking back now, not sure if using Pipeline or using FeatureUnion > would have helped in this case or prevented this since this error occurred > on the prediction side, not on training or modeling side. > > On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman > wrote: > >> Use a Pipeline to help avoid this kind of issue (and others). You might >> also want to do something like http://scikit-learn.org/stable >> /auto_examples/hetero_feature_union.html >> >> On 3 August 2017 at 12:01, pybokeh wrote: >> >>> Hello, >>> I am studying this example from scikit-learn's site: >>> http://scikit-learn.org/stable/tutorial/text_analytics/worki >>> ng_with_text_data.html >>> >>> The problem that I need to solve is very similar to this example, except >>> I have one >>> additional feature column (part #) that is categorical of type string. >>> My label or target >>> values consist of just 2 values: 0 or 1. >>> >>> With that additional feature column, I am transforming it with a >>> LabelEncoder and >>> then I am encoding it with the OneHotEncoder. >>> >>> Then I am concatenating that one-hot encoded column (part #) to the >>> text/document >>> feature column (complaint), which I had applied the CountVectorizer and >>> TfidfTransformer transformations. >>> >>> Then I chose the MultinomialNB model to fit my concatenated training >>> data with. >>> >>> The problem I run into is when I invoke the prediction, I get a >>> dimension mis-match error. >>> >>> Here's my jupyter notebook gist: >>> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85 >>> ef86ba41424b311 >>> >>> I would gladly appreciate it if someone can guide me where I went >>> wrong. Thanks! >>> >>> - Daniel >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sambarnett95 at gmail.com Fri Aug 4 06:29:50 2017 From: sambarnett95 at gmail.com (Sam Barnett) Date: Fri, 4 Aug 2017 11:29:50 +0100 Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline with a custom transformer Message-ID: Hi Andy, I have since been able to resolve the pickling issue, though I am now getting an error message saying that an error message does not include the expected string 'fit'. In general, I am trying to use the fit() method of my classifier to instantiate a separate SVC() classifier with a custom kernel, fit THAT to the data, then return this instance as the fitted version of the new classifier. Is this possible in theory? If so, what is the best way to implement it? As before, the requisite code and a .ipynb file is attached. Best, Sam On Thu, Aug 3, 2017 at 6:35 PM, Andreas Mueller wrote: > Hi Sam. > You need to put these into a reachable namespace (possibly as private > functions) so that they can be pickled. > Please stay on the sklearn mailing list, I might not have time to reply. > > Andy > > > On 08/03/2017 01:24 PM, Sam Barnett wrote: > > Hi Andy, > > I've since tried a different solution: instead of a pipeline, I've simply > created a classifier that is for the most part like svm.SVC, though it > takes a few extra inputs for the sequentialisation step. I've used a Python > function that can compute the Gram matrix between two datasets of any shape > to pass into SVC(), though I'm now having trouble with pickling on the > check_estimator test. It appears that SeqSVC.fit() doesn't like to have > methods defined within it. Can you see how to pass this test? (the .ipynb > file shows the error). > > Best, > Sam > > On Wed, Aug 2, 2017 at 9:44 PM, Sam Barnett > wrote: > >> You're right: it does fail without GridSearchCV when I change the size of >> seq_test. I will look at the transform tomorrow to see if I can work this >> out. Thank you for your help so far! >> >> On Wed, Aug 2, 2017 at 9:20 PM, Andreas Mueller wrote: >> >>> Change the size of seq_test in your notebook and you'll see the failure >>> without GridSearchCV. >>> I haven't looked at your code in detail, but transform is supposed to >>> work on arbitrary new data with the same number of features. >>> Your code requires the test data to have the same shape as the training >>> data. >>> Cross-validation will lead to training data and test data having >>> different sizes. But I feel like something is already wrong if your >>> test data size depends on your training data size. >>> >>> >>> >>> On 08/02/2017 03:08 PM, Sam Barnett wrote: >>> >>> Hi Andy, >>> >>> The purpose of the transformer is to take an ordinary kernel (in this >>> case I have taken 'rbf' as a default) and return a 'sequentialised' kernel >>> using a few extra parameters. Hence, the transformer takes an ordinary >>> data-target pair X, y as its input, and the fit_transform(X, y) method will >>> output the Gram matrix for X that is associated with this sequentialised >>> kernel. In the pipeline, this Gram matrix is passed into an SVC classifier >>> with the kernel parameter set to 'precomputed'. >>> >>> Therefore, I do not think your hacky solution would be possible. >>> However, I am still unsure how to implement your first solution: won't the >>> Gram matrix from the transformer contain all the necessary kernel values? >>> Could you elaborate further? >>> >>> >>> Best, >>> Sam >>> >>> On Wed, Aug 2, 2017 at 5:05 PM, Andreas Mueller >>> wrote: >>> >>>> Hi Sam. >>>> GridSearchCV will do cross-validation, which requires to "transform" >>>> the test data. >>>> The shape of the test-data will be different from the shape of the >>>> training data. >>>> You need to have the ability to compute the kernel between the training >>>> data and new test data. >>>> >>>> A more hacky solution would be to compute the full kernel matrix in >>>> advance and pass that to GridSearchCV. >>>> >>>> You probably don't need it here, but you should also checkout what the >>>> _pairwise attribute does in cross-validation, >>>> because that it likely to come up when playing with kernels. >>>> >>>> Hth, >>>> Andy >>>> >>>> >>>> On 08/02/2017 08:38 AM, Sam Barnett wrote: >>>> >>>> Dear all, >>>> >>>> I have created a 2-step pipeline with a custom transformer followed by >>>> a simple SVC classifier, and I wish to run a grid-search over it. I am able >>>> to successfully create the transformer and the pipeline, and each of these >>>> elements work fine. However, when I try to use the fit() method on my >>>> GridSearchCV object, I get the following error: >>>> >>>> 57 # during fit. >>>> 58 if X.shape != self.input_shape_: >>>> ---> 59 raise ValueError('Shape of input is different from >>>> what was seen ' >>>> 60 'in `fit`') >>>> 61 >>>> >>>> ValueError: Shape of input is different from what was seen in `fit` >>>> >>>> For a full breakdown of the problem, I have written a Jupyter notebook >>>> showing exactly how the error occurs (this also contains all .py files >>>> necessary to run the notebook). Can anybody see how to work through this? >>>> >>>> Many thanks, >>>> Sam Barnett >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: seqsvc.py Type: text/x-python-script Size: 3051 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Sequential Kernel SVC GridSearchCV Test.ipynb Type: application/octet-stream Size: 7678 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: SeqKernelLucy.py Type: text/x-python-script Size: 2628 bytes Desc: not available URL: From albertthomas88 at gmail.com Fri Aug 4 08:49:16 2017 From: albertthomas88 at gmail.com (Albert Thomas) Date: Fri, 04 Aug 2017 12:49:16 +0000 Subject: [scikit-learn] OneClassSvm | Different results on different runs In-Reply-To: References:

Message-ID: I opened an issue https://github.com/scikit-learn/scikit-learn/issues/9497 Albert On Thu, Aug 3, 2017 at 6:16 PM Andreas Mueller wrote: > > > On 08/03/2017 09:17 AM, Albert Thomas wrote: > > Yes, in fact, changing the random_state might have an influence on the > > result. The docstring of the random_state parameter for the > > OneClassSVM seems incorrect though... > PR or issue welcome. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Fri Aug 4 09:54:00 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 4 Aug 2017 15:54:00 +0200 Subject: [scikit-learn] Extra trees tuning parameters In-Reply-To: References: Message-ID: I believe so even though it's always better to check in the code to see how this parameter is actually used. -- Olivier ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Aug 4 10:50:40 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 4 Aug 2017 10:50:40 -0400 Subject: [scikit-learn] Problems with running GridSearchCV on a pipeline with a custom transformer In-Reply-To: References: Message-ID: <90aed61d-2d9b-8d02-3c9b-fbe0ccdd98ee@gmail.com> Yes, that's totally fine. The error is unrelated and just means you need to call ``check_is_fitted`` in your predict method to give a nicer error message. On 08/04/2017 06:29 AM, Sam Barnett wrote: > Hi Andy, > I have since been able to resolve the pickling issue, though I am now > getting an error message saying that an error message does not include > the expected string 'fit'. In general, I am trying to use the fit() > method of my classifier to instantiate a separate SVC() classifier > with a custom kernel, fit THAT to the data, then return this instance > as the fitted version of the new classifier. Is this possible in > theory? If so, what is the best way to implement it? > > As before, the requisite code and a .ipynb file is attached. > > Best, > Sam > > On Thu, Aug 3, 2017 at 6:35 PM, Andreas Mueller > wrote: > > Hi Sam. > You need to put these into a reachable namespace (possibly as > private functions) so that they can be pickled. > Please stay on the sklearn mailing list, I might not have time to > reply. > > Andy > > > On 08/03/2017 01:24 PM, Sam Barnett wrote: >> Hi Andy, >> >> I've since tried a different solution: instead of a pipeline, >> I've simply created a classifier that is for the most part like >> svm.SVC, though it takes a few extra inputs for the >> sequentialisation step. I've used a Python function that can >> compute the Gram matrix between two datasets of any shape to pass >> into SVC(), though I'm now having trouble with pickling on the >> check_estimator test. It appears that SeqSVC.fit() doesn't like >> to have methods defined within it. Can you see how to pass this >> test? (the .ipynb file shows the error). >> >> Best, >> Sam >> >> On Wed, Aug 2, 2017 at 9:44 PM, Sam Barnett >> > wrote: >> >> You're right: it does fail without GridSearchCV when I change >> the size of seq_test. I will look at the transform tomorrow >> to see if I can work this out. Thank you for your help so far! >> >> On Wed, Aug 2, 2017 at 9:20 PM, Andreas Mueller >> > wrote: >> >> Change the size of seq_test in your notebook and you'll >> see the failure without GridSearchCV. >> I haven't looked at your code in detail, but transform is >> supposed to work on arbitrary new data with the same >> number of features. >> Your code requires the test data to have the same shape >> as the training data. >> Cross-validation will lead to training data and test data >> having different sizes. But I feel like something is >> already wrong if your >> test data size depends on your training data size. >> >> >> >> On 08/02/2017 03:08 PM, Sam Barnett wrote: >>> Hi Andy, >>> >>> The purpose of the transformer is to take an ordinary >>> kernel (in this case I have taken 'rbf' as a default) >>> and return a 'sequentialised' kernel using a few extra >>> parameters. Hence, the transformer takes an ordinary >>> data-target pair X, y as its input, and the >>> fit_transform(X, y) method will output the Gram matrix >>> for X that is associated with this sequentialised >>> kernel. In the pipeline, this Gram matrix is passed into >>> an SVC classifier with the kernel parameter set to >>> 'precomputed'. >>> >>> Therefore, I do not think your hacky solution would be >>> possible. However, I am still unsure how to implement >>> your first solution: won't the Gram matrix from the >>> transformer contain all the necessary kernel values? >>> Could you elaborate further? >>> >>> >>> Best, >>> Sam >>> >>> On Wed, Aug 2, 2017 at 5:05 PM, Andreas Mueller >>> > wrote: >>> >>> Hi Sam. >>> GridSearchCV will do cross-validation, which >>> requires to "transform" the test data. >>> The shape of the test-data will be different from >>> the shape of the training data. >>> You need to have the ability to compute the kernel >>> between the training data and new test data. >>> >>> A more hacky solution would be to compute the full >>> kernel matrix in advance and pass that to GridSearchCV. >>> >>> You probably don't need it here, but you should also >>> checkout what the _pairwise attribute does in >>> cross-validation, >>> because that it likely to come up when playing with >>> kernels. >>> >>> Hth, >>> Andy >>> >>> >>> On 08/02/2017 08:38 AM, Sam Barnett wrote: >>>> Dear all, >>>> >>>> I have created a 2-step pipeline with a custom >>>> transformer followed by a simple SVC classifier, >>>> and I wish to run a grid-search over it. I am able >>>> to successfully create the transformer and the >>>> pipeline, and each of these elements work fine. >>>> However, when I try to use the fit() method on my >>>> GridSearchCV object, I get the following error: >>>> >>>> 57 # during fit. >>>> 58 if X.shape != self.input_shape_: >>>> ---> 59 raise ValueError('Shape of >>>> input is different from what was seen ' >>>> 60 'in `fit`') >>>> 61 >>>> >>>> ValueError: Shape of input is different from what >>>> was seen in `fit` >>>> >>>> For a full breakdown of the problem, I have written >>>> a Jupyter notebook showing exactly how the error >>>> occurs (this also contains all .py files necessary >>>> to run the notebook). Can anybody see how to work >>>> through this? >>>> >>>> Many thanks, >>>> Sam Barnett >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From georg.kf.heiler at gmail.com Sat Aug 5 05:10:57 2017 From: georg.kf.heiler at gmail.com (Georg Heiler) Date: Sat, 05 Aug 2017 09:10:57 +0000 Subject: [scikit-learn] transform categorical data to numerical representation Message-ID: Hi, the LabelEncooder is only meant for a single column i.e. target variable. Is the DictVectorizeer or a manual chaining of multiple LabelEncoders (one per categorical column) the desired way to get values which can be fed into a subsequent classifier? Is there some way I have overlooked which works better and possibly also can handle unseen values by applying most frequent imputation? regards, Georg -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sat Aug 5 12:13:10 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sat, 5 Aug 2017 12:13:10 -0400 Subject: [scikit-learn] transform categorical data to numerical representation In-Reply-To: References: Message-ID: <6D0BF22C-9ABA-4C2A-B35B-210673439286@gmail.com> Hi, Georg, I bring this up every time here on the mailing list :), and you probably aware of this issue, but it makes a difference whether your categorical data is nominal or ordinal. For instance if you have an ordinal variable like with values like {small, medium, large} you probably want to encode it as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your domain knowledge regarding the variable. If you have sth like {blue, red, green} it may make more sense to do a one-hot encoding so that the classifier doesn't assume a relationship between the variables like blue > red > green or sth like that. Now, the DictVectorizer and OneHotEncoder are both doing one hot encoding. The LabelEncoder does convert a variable to integer values, but if you have sth like {small, medium, large}, it wouldn't know the order (if that's an ordinal variable) and it would just assign arbitrary integers in increasing order. Thus, if you are dealing ordinal variables, there's no way around doing this manually; for example you could create mapping dictionaries for that (most conveniently done in pandas). Best, Sebastian > On Aug 5, 2017, at 5:10 AM, Georg Heiler wrote: > > Hi, > > the LabelEncooder is only meant for a single column i.e. target variable. Is the DictVectorizeer or a manual chaining of multiple LabelEncoders (one per categorical column) the desired way to get values which can be fed into a subsequent classifier? > > Is there some way I have overlooked which works better and possibly also can handle unseen values by applying most frequent imputation? > > regards, > Georg > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Sat Aug 5 18:47:23 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 6 Aug 2017 08:47:23 +1000 Subject: [scikit-learn] transform categorical data to numerical representation In-Reply-To: <6D0BF22C-9ABA-4C2A-B35B-210673439286@gmail.com> References: <6D0BF22C-9ABA-4C2A-B35B-210673439286@gmail.com> Message-ID: We are working on CategoricalEncoder in https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more with this kind of thing. Feedback and testing is welcome. On 6 August 2017 at 02:13, Sebastian Raschka wrote: > Hi, Georg, > > I bring this up every time here on the mailing list :), and you probably > aware of this issue, but it makes a difference whether your categorical > data is nominal or ordinal. For instance if you have an ordinal variable > like with values like {small, medium, large} you probably want to encode it > as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your > domain knowledge regarding the variable. If you have sth like {blue, red, > green} it may make more sense to do a one-hot encoding so that the > classifier doesn't assume a relationship between the variables like blue > > red > green or sth like that. > > Now, the DictVectorizer and OneHotEncoder are both doing one hot encoding. > The LabelEncoder does convert a variable to integer values, but if you have > sth like {small, medium, large}, it wouldn't know the order (if that's an > ordinal variable) and it would just assign arbitrary integers in increasing > order. Thus, if you are dealing ordinal variables, there's no way around > doing this manually; for example you could create mapping dictionaries for > that (most conveniently done in pandas). > > Best, > Sebastian > > > On Aug 5, 2017, at 5:10 AM, Georg Heiler > wrote: > > > > Hi, > > > > the LabelEncooder is only meant for a single column i.e. target > variable. Is the DictVectorizeer or a manual chaining of multiple > LabelEncoders (one per categorical column) the desired way to get values > which can be fed into a subsequent classifier? > > > > Is there some way I have overlooked which works better and possibly also > can handle unseen values by applying most frequent imputation? > > > > regards, > > Georg > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From georg.kf.heiler at gmail.com Sun Aug 6 06:30:28 2017 From: georg.kf.heiler at gmail.com (Georg Heiler) Date: Sun, 06 Aug 2017 10:30:28 +0000 Subject: [scikit-learn] transform categorical data to numerical representation In-Reply-To: References: <6D0BF22C-9ABA-4C2A-B35B-210673439286@gmail.com> Message-ID: @sebastian: thanks. Indeed, I am aware of this problem. I developed something here: https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce but realized that the performance of prediction is pretty lame when there are around 100-150 columns used as the input. Do you have some ideas how to speed this up? Regards, Georg Joel Nothman schrieb am So., 6. Aug. 2017 um 00:49 Uhr: > We are working on CategoricalEncoder in > https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more > with this kind of thing. Feedback and testing is welcome. > > On 6 August 2017 at 02:13, Sebastian Raschka wrote: > >> Hi, Georg, >> >> I bring this up every time here on the mailing list :), and you probably >> aware of this issue, but it makes a difference whether your categorical >> data is nominal or ordinal. For instance if you have an ordinal variable >> like with values like {small, medium, large} you probably want to encode it >> as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your >> domain knowledge regarding the variable. If you have sth like {blue, red, >> green} it may make more sense to do a one-hot encoding so that the >> classifier doesn't assume a relationship between the variables like blue > >> red > green or sth like that. >> >> Now, the DictVectorizer and OneHotEncoder are both doing one hot >> encoding. The LabelEncoder does convert a variable to integer values, but >> if you have sth like {small, medium, large}, it wouldn't know the order (if >> that's an ordinal variable) and it would just assign arbitrary integers in >> increasing order. Thus, if you are dealing ordinal variables, there's no >> way around doing this manually; for example you could create mapping >> dictionaries for that (most conveniently done in pandas). >> >> Best, >> Sebastian >> >> > On Aug 5, 2017, at 5:10 AM, Georg Heiler >> wrote: >> > >> > Hi, >> > >> > the LabelEncooder is only meant for a single column i.e. target >> variable. Is the DictVectorizeer or a manual chaining of multiple >> LabelEncoders (one per categorical column) the desired way to get values >> which can be fed into a subsequent classifier? >> > >> > Is there some way I have overlooked which works better and possibly >> also can handle unseen values by applying most frequent imputation? >> > >> > regards, >> > Georg >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Aug 6 14:37:15 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 6 Aug 2017 14:37:15 -0400 Subject: [scikit-learn] transform categorical data to numerical representation In-Reply-To: References: <6D0BF22C-9ABA-4C2A-B35B-210673439286@gmail.com>

Message-ID: <103609E5-E50B-4993-87F2-11661E7C7EB5@gmail.com> > performance of prediction is pretty lame when there are around 100-150 columns used as the input. you are talking about computational performance when you are calling the "transform" method? Have you done some profiling to find out where your bottle neck (in the for loop) is? Just one a very quick look, I think this data.loc[~data[column].isin(fittedLabels), column] = str(replacementForUnseen) is already very slow because fittedLabels is an array where you have O(n) lookup instead of an average O(1) by using a hash table. Or is the isin function converting it to a hashtable/set/dict? In general, would it maybe help to use pandas' factorize? https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html For predict time, say you have only 1 example for prediction that needs to be converted, you could append prototypes of all possible values that could occur, do the transformation, and then only pass the 1 transformed sample to the classifier. I guess that could be even slow though ... Best, Sebastian > On Aug 6, 2017, at 6:30 AM, Georg Heiler wrote: > > @sebastian: thanks. Indeed, I am aware of this problem. > > I developed something here: https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce but realized that the performance of prediction is pretty lame when there are around 100-150 columns used as the input. > Do you have some ideas how to speed this up? > > Regards, > Georg > > Joel Nothman schrieb am So., 6. Aug. 2017 um 00:49 Uhr: > We are working on CategoricalEncoder in https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more with this kind of thing. Feedback and testing is welcome. > > On 6 August 2017 at 02:13, Sebastian Raschka wrote: > Hi, Georg, > > I bring this up every time here on the mailing list :), and you probably aware of this issue, but it makes a difference whether your categorical data is nominal or ordinal. For instance if you have an ordinal variable like with values like {small, medium, large} you probably want to encode it as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your domain knowledge regarding the variable. If you have sth like {blue, red, green} it may make more sense to do a one-hot encoding so that the classifier doesn't assume a relationship between the variables like blue > red > green or sth like that. > > Now, the DictVectorizer and OneHotEncoder are both doing one hot encoding. The LabelEncoder does convert a variable to integer values, but if you have sth like {small, medium, large}, it wouldn't know the order (if that's an ordinal variable) and it would just assign arbitrary integers in increasing order. Thus, if you are dealing ordinal variables, there's no way around doing this manually; for example you could create mapping dictionaries for that (most conveniently done in pandas). > > Best, > Sebastian > > > On Aug 5, 2017, at 5:10 AM, Georg Heiler wrote: > > > > Hi, > > > > the LabelEncooder is only meant for a single column i.e. target variable. Is the DictVectorizeer or a manual chaining of multiple LabelEncoders (one per categorical column) the desired way to get values which can be fed into a subsequent classifier? > > > > Is there some way I have overlooked which works better and possibly also can handle unseen values by applying most frequent imputation? > > > > regards, > > Georg > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From georg.kf.heiler at gmail.com Mon Aug 7 02:40:18 2017 From: georg.kf.heiler at gmail.com (Georg Heiler) Date: Mon, 07 Aug 2017 06:40:18 +0000 Subject: [scikit-learn] transform categorical data to numerical representation In-Reply-To: <103609E5-E50B-4993-87F2-11661E7C7EB5@gmail.com> References: <6D0BF22C-9ABA-4C2A-B35B-210673439286@gmail.com>

<103609E5-E50B-4993-87F2-11661E7C7EB5@gmail.com> Message-ID: I will need to look into factorize. Here is the result from profiling the transform method on a single new observation https://codereview.stackexchange.com/q/171622/132999 Best Georg Sebastian Raschka schrieb am So. 6. Aug. 2017 um 20:39: > > performance of prediction is pretty lame when there are around 100-150 > columns used as the input. > > you are talking about computational performance when you are calling the > "transform" method? Have you done some profiling to find out where your > bottle neck (in the for loop) is? Just one a very quick look, I think this > > data.loc[~data[column].isin(fittedLabels), column] = > str(replacementForUnseen) > > is already very slow because fittedLabels is an array where you have O(n) > lookup instead of an average O(1) by using a hash table. Or is the isin > function converting it to a hashtable/set/dict? > > In general, would it maybe help to use pandas' factorize? > https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html > For predict time, say you have only 1 example for prediction that needs to > be converted, you could append prototypes of all possible values that could > occur, do the transformation, and then only pass the 1 transformed sample > to the classifier. I guess that could be even slow though ... > > Best, > Sebastian > > > On Aug 6, 2017, at 6:30 AM, Georg Heiler > wrote: > > > > @sebastian: thanks. Indeed, I am aware of this problem. > > > > I developed something here: > https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce but > realized that the performance of prediction is pretty lame when there are > around 100-150 columns used as the input. > > Do you have some ideas how to speed this up? > > > > Regards, > > Georg > > > > Joel Nothman schrieb am So., 6. Aug. 2017 um > 00:49 Uhr: > > We are working on CategoricalEncoder in > https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more > with this kind of thing. Feedback and testing is welcome. > > > > On 6 August 2017 at 02:13, Sebastian Raschka > wrote: > > Hi, Georg, > > > > I bring this up every time here on the mailing list :), and you probably > aware of this issue, but it makes a difference whether your categorical > data is nominal or ordinal. For instance if you have an ordinal variable > like with values like {small, medium, large} you probably want to encode it > as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your > domain knowledge regarding the variable. If you have sth like {blue, red, > green} it may make more sense to do a one-hot encoding so that the > classifier doesn't assume a relationship between the variables like blue > > red > green or sth like that. > > > > Now, the DictVectorizer and OneHotEncoder are both doing one hot > encoding. The LabelEncoder does convert a variable to integer values, but > if you have sth like {small, medium, large}, it wouldn't know the order (if > that's an ordinal variable) and it would just assign arbitrary integers in > increasing order. Thus, if you are dealing ordinal variables, there's no > way around doing this manually; for example you could create mapping > dictionaries for that (most conveniently done in pandas). > > > > Best, > > Sebastian > > > > > On Aug 5, 2017, at 5:10 AM, Georg Heiler > wrote: > > > > > > Hi, > > > > > > the LabelEncooder is only meant for a single column i.e. target > variable. Is the DictVectorizeer or a manual chaining of multiple > LabelEncoders (one per categorical column) the desired way to get values > which can be fed into a subsequent classifier? > > > > > > Is there some way I have overlooked which works better and possibly > also can handle unseen values by applying most frequent imputation? > > > > > > regards, > > > Georg > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From georg.kf.heiler at gmail.com Mon Aug 7 02:41:39 2017 From: georg.kf.heiler at gmail.com (Georg Heiler) Date: Mon, 07 Aug 2017 06:41:39 +0000 Subject: [scikit-learn] transform categorical data to numerical representation In-Reply-To: References: <6D0BF22C-9ABA-4C2A-B35B-210673439286@gmail.com>

<103609E5-E50B-4993-87F2-11661E7C7EB5@gmail.com> Message-ID: To my understanding pandas.factorize only works for the static case where no unseen variables can occur. Georg Heiler schrieb am Mo. 7. Aug. 2017 um 08:40: > I will need to look into factorize. Here is the result from profiling the > transform method on a single new observation > https://codereview.stackexchange.com/q/171622/132999 > > > Best Georg > Sebastian Raschka schrieb am So. 6. Aug. 2017 um > 20:39: > >> > performance of prediction is pretty lame when there are around 100-150 >> columns used as the input. >> >> you are talking about computational performance when you are calling the >> "transform" method? Have you done some profiling to find out where your >> bottle neck (in the for loop) is? Just one a very quick look, I think this >> >> data.loc[~data[column].isin(fittedLabels), column] = >> str(replacementForUnseen) >> >> is already very slow because fittedLabels is an array where you have O(n) >> lookup instead of an average O(1) by using a hash table. Or is the isin >> function converting it to a hashtable/set/dict? >> >> In general, would it maybe help to use pandas' factorize? >> https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html >> For predict time, say you have only 1 example for prediction that needs >> to be converted, you could append prototypes of all possible values that >> could occur, do the transformation, and then only pass the 1 transformed >> sample to the classifier. I guess that could be even slow though ... >> >> Best, >> Sebastian >> >> > On Aug 6, 2017, at 6:30 AM, Georg Heiler >> wrote: >> > >> > @sebastian: thanks. Indeed, I am aware of this problem. >> > >> > I developed something here: >> https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce but >> realized that the performance of prediction is pretty lame when there are >> around 100-150 columns used as the input. >> > Do you have some ideas how to speed this up? >> > >> > Regards, >> > Georg >> > >> > Joel Nothman schrieb am So., 6. Aug. 2017 um >> 00:49 Uhr: >> > We are working on CategoricalEncoder in >> https://github.com/scikit-learn/scikit-learn/pull/9151 to help users >> more with this kind of thing. Feedback and testing is welcome. >> > >> > On 6 August 2017 at 02:13, Sebastian Raschka >> wrote: >> > Hi, Georg, >> > >> > I bring this up every time here on the mailing list :), and you >> probably aware of this issue, but it makes a difference whether your >> categorical data is nominal or ordinal. For instance if you have an ordinal >> variable like with values like {small, medium, large} you probably want to >> encode it as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on >> your domain knowledge regarding the variable. If you have sth like {blue, >> red, green} it may make more sense to do a one-hot encoding so that the >> classifier doesn't assume a relationship between the variables like blue > >> red > green or sth like that. >> > >> > Now, the DictVectorizer and OneHotEncoder are both doing one hot >> encoding. The LabelEncoder does convert a variable to integer values, but >> if you have sth like {small, medium, large}, it wouldn't know the order (if >> that's an ordinal variable) and it would just assign arbitrary integers in >> increasing order. Thus, if you are dealing ordinal variables, there's no >> way around doing this manually; for example you could create mapping >> dictionaries for that (most conveniently done in pandas). >> > >> > Best, >> > Sebastian >> > >> > > On Aug 5, 2017, at 5:10 AM, Georg Heiler >> wrote: >> > > >> > > Hi, >> > > >> > > the LabelEncooder is only meant for a single column i.e. target >> variable. Is the DictVectorizeer or a manual chaining of multiple >> LabelEncoders (one per categorical column) the desired way to get values >> which can be fed into a subsequent classifier? >> > > >> > > Is there some way I have overlooked which works better and possibly >> also can handle unseen values by applying most frequent imputation? >> > > >> > > regards, >> > > Georg >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andre.nascimento.melo at gmail.com Thu Aug 10 09:55:22 2017 From: andre.nascimento.melo at gmail.com (=?UTF-8?Q?Andr=C3=A9_Melo?=) Date: Thu, 10 Aug 2017 15:55:22 +0200 Subject: [scikit-learn] Truncated svd not working for complex matrices Message-ID: Hello all, I'm trying to use the randomized version of scikit-learn's TruncatedSVD (although I'm actually calling the internal function randomized_svd to get the actual u, s, v matrices). While it is working fine for real matrices, for complex matrices I can't get back the original matrix even though the singular values are exactly correct: >>> import numpy as np >>> from sklearn.utils.extmath import randomized_svd >>> N = 3 >>> a = np.random.rand(N, N)*(1 + 1j) >>> u1, s1, v1 = np.linalg.svd(a) >>> u2, s2, v2 = randomized_svd(a, n_components=N, n_iter=7) >>> np.allclose(s1, s2) True >>> np.allclose(a, u1.dot(np.diag(s1)).dot(v1)) True >>> np.allclose(a, u2.dot(np.diag(s2)).dot(v2)) False Any idea what could be wrong? Thank you! Best regards, Andre Melo From olivier.grisel at ensta.org Thu Aug 10 10:13:16 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 10 Aug 2017 16:13:16 +0200 Subject: [scikit-learn] Truncated svd not working for complex matrices In-Reply-To: References: Message-ID: I have no idea whether the randomized SVD method is supposed to work for complex data or not (from a mathematical point of view). I think that all scikit-learn estimators assume real data (or integer data for class labels) and our input validation utilities will cast numeric values to float64 by default. This might be the cause of your problem. Have a look at the source code to confirm. The reference to the paper can also be found in the docstring of those functions. -- Olivier ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From andre.nascimento.melo at gmail.com Thu Aug 10 10:56:43 2017 From: andre.nascimento.melo at gmail.com (=?UTF-8?Q?Andr=C3=A9_Melo?=) Date: Thu, 10 Aug 2017 16:56:43 +0200 Subject: [scikit-learn] Truncated svd not working for complex matrices In-Reply-To: References: Message-ID: Hi Olivier, Thank you very much for your reply. I was convinced it couldn't be a fundamental mathematical issue because the singular values were coming out exactly right, so it had to be a problem with the way complex values were being handled. I decided to look at the source code and it turns out the problem is when the following transformation is applied: U = np.dot(Q, Uhat) Replacing this by U = np.dot(Q.conj(), Uhat) solves the issue! Should I report this on github? On 10 August 2017 at 16:13, Olivier Grisel wrote: > I have no idea whether the randomized SVD method is supposed to work for > complex data or not (from a mathematical point of view). I think that all > scikit-learn estimators assume real data (or integer data for class labels) > and our input validation utilities will cast numeric values to float64 by > default. This might be the cause of your problem. Have a look at the source > code to confirm. The reference to the paper can also be found in the > docstring of those functions. > > -- > Olivier > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From andre.nascimento.melo at gmail.com Thu Aug 10 11:08:09 2017 From: andre.nascimento.melo at gmail.com (=?UTF-8?Q?Andr=C3=A9_Melo?=) Date: Thu, 10 Aug 2017 17:08:09 +0200 Subject: [scikit-learn] Truncated svd not working for complex matrices In-Reply-To: References:

Message-ID: Actually, it makes more sense to change B = safe_sparse_dot(Q.T, M) To B = safe_sparse_dot(Q.T.conj(), M) On 10 August 2017 at 16:56, Andr? Melo wrote: > Hi Olivier, > > Thank you very much for your reply. I was convinced it couldn't be a > fundamental mathematical issue because the singular values were coming > out exactly right, so it had to be a problem with the way complex > values were being handled. > > I decided to look at the source code and it turns out the problem is > when the following transformation is applied: > > U = np.dot(Q, Uhat) > > Replacing this by > > U = np.dot(Q.conj(), Uhat) > > solves the issue! Should I report this on github? > > On 10 August 2017 at 16:13, Olivier Grisel wrote: >> I have no idea whether the randomized SVD method is supposed to work for >> complex data or not (from a mathematical point of view). I think that all >> scikit-learn estimators assume real data (or integer data for class labels) >> and our input validation utilities will cast numeric values to float64 by >> default. This might be the cause of your problem. Have a look at the source >> code to confirm. The reference to the paper can also be found in the >> docstring of those functions. >> >> -- >> Olivier >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> From joel.nothman at gmail.com Thu Aug 10 23:41:47 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 11 Aug 2017 13:41:47 +1000 Subject: [scikit-learn] Truncated svd not working for complex matrices In-Reply-To: References:

Message-ID: Should we be more explicitly forbidding complex data in most estimators, and perhaps allow it in a few where it is tested (particularly decomposition)? On 11 August 2017 at 01:08, Andr? Melo wrote: > Actually, it makes more sense to change > > B = safe_sparse_dot(Q.T, M) > > To > B = safe_sparse_dot(Q.T.conj(), M) > > On 10 August 2017 at 16:56, Andr? Melo > wrote: > > Hi Olivier, > > > > Thank you very much for your reply. I was convinced it couldn't be a > > fundamental mathematical issue because the singular values were coming > > out exactly right, so it had to be a problem with the way complex > > values were being handled. > > > > I decided to look at the source code and it turns out the problem is > > when the following transformation is applied: > > > > U = np.dot(Q, Uhat) > > > > Replacing this by > > > > U = np.dot(Q.conj(), Uhat) > > > > solves the issue! Should I report this on github? > > > > On 10 August 2017 at 16:13, Olivier Grisel > wrote: > >> I have no idea whether the randomized SVD method is supposed to work for > >> complex data or not (from a mathematical point of view). I think that > all > >> scikit-learn estimators assume real data (or integer data for class > labels) > >> and our input validation utilities will cast numeric values to float64 > by > >> default. This might be the cause of your problem. Have a look at the > source > >> code to confirm. The reference to the paper can also be found in the > >> docstring of those functions. > >> > >> -- > >> Olivier > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Fri Aug 11 03:16:59 2017 From: drraph at gmail.com (Raphael C) Date: Fri, 11 Aug 2017 09:16:59 +0200 Subject: [scikit-learn] Truncated svd not working for complex matrices In-Reply-To: References:

Message-ID: Although the first priority should be correctness (in implementation and documentation) and it makes sense to explicitly test for inputs for which code will give the wrong answer, it would be great if we could support complex data types, especially where it is very little extra work. Raphael On 11 August 2017 at 05:41, Joel Nothman wrote: > Should we be more explicitly forbidding complex data in most estimators, and > perhaps allow it in a few where it is tested (particularly decomposition)? > > On 11 August 2017 at 01:08, Andr? Melo > wrote: >> >> Actually, it makes more sense to change >> >> B = safe_sparse_dot(Q.T, M) >> >> To >> B = safe_sparse_dot(Q.T.conj(), M) >> >> On 10 August 2017 at 16:56, Andr? Melo >> wrote: >> > Hi Olivier, >> > >> > Thank you very much for your reply. I was convinced it couldn't be a >> > fundamental mathematical issue because the singular values were coming >> > out exactly right, so it had to be a problem with the way complex >> > values were being handled. >> > >> > I decided to look at the source code and it turns out the problem is >> > when the following transformation is applied: >> > >> > U = np.dot(Q, Uhat) >> > >> > Replacing this by >> > >> > U = np.dot(Q.conj(), Uhat) >> > >> > solves the issue! Should I report this on github? >> > >> > On 10 August 2017 at 16:13, Olivier Grisel >> > wrote: >> >> I have no idea whether the randomized SVD method is supposed to work >> >> for >> >> complex data or not (from a mathematical point of view). I think that >> >> all >> >> scikit-learn estimators assume real data (or integer data for class >> >> labels) >> >> and our input validation utilities will cast numeric values to float64 >> >> by >> >> default. This might be the cause of your problem. Have a look at the >> >> source >> >> code to confirm. The reference to the paper can also be found in the >> >> docstring of those functions. >> >> >> >> -- >> >> Olivier >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From sambarnett95 at gmail.com Fri Aug 11 06:16:50 2017 From: sambarnett95 at gmail.com (Sam Barnett) Date: Fri, 11 Aug 2017 11:16:50 +0100 Subject: [scikit-learn] Overflow Error with Cross-Validation (but not normally fitting the data) Message-ID: To all, I am working on a scikit-learn estimator that performs a version of SVC with a custom kernel. Unfortunately, I have been presented with a problem: when running a grid search (or even using the cross_val_score function), my estimator encounters an overflow error when evaluating my kernel (specifically, in an array multiplication operation). What is particularly strange about this is that, when I train the estimator on the whole dataset, this error does not occur. In other words: the problem only appears to occur when the data is split into folds. Is this something that has been seen before? How ought I fix this? I have attached the source code below (in particular, see the notebook for how the problem arises). Best, Sam -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: kernelsqizer.py Type: text/x-python-script Size: 2592 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: SeqSVC Toy Data Tests.ipynb Type: application/octet-stream Size: 6613 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: seqsvc.py Type: text/x-python-script Size: 11023 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: timeseriestools.py Type: text/x-python-script Size: 1419 bytes Desc: not available URL: From t3kcit at gmail.com Fri Aug 11 12:37:12 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 11 Aug 2017 12:37:12 -0400 Subject: [scikit-learn] Truncated svd not working for complex matrices In-Reply-To: References:

Message-ID: I opened https://github.com/scikit-learn/scikit-learn/issues/9528 I suggest to first error everywhere and then fix those for which it seems easy and worth it, as Joel said, probably mostly in decomposition. Though adding support even in a few places seems like dangerous feature creep. On 08/11/2017 03:16 AM, Raphael C wrote: > Although the first priority should be correctness (in implementation > and documentation) and it makes sense to explicitly test for inputs > for which code will give the wrong answer, it would be great if we > could support complex data types, especially where it is very little > extra work. > > Raphael > > On 11 August 2017 at 05:41, Joel Nothman wrote: >> Should we be more explicitly forbidding complex data in most estimators, and >> perhaps allow it in a few where it is tested (particularly decomposition)? >> >> On 11 August 2017 at 01:08, Andr? Melo >> wrote: >>> Actually, it makes more sense to change >>> >>> B = safe_sparse_dot(Q.T, M) >>> >>> To >>> B = safe_sparse_dot(Q.T.conj(), M) >>> >>> On 10 August 2017 at 16:56, Andr? Melo >>> wrote: >>>> Hi Olivier, >>>> >>>> Thank you very much for your reply. I was convinced it couldn't be a >>>> fundamental mathematical issue because the singular values were coming >>>> out exactly right, so it had to be a problem with the way complex >>>> values were being handled. >>>> >>>> I decided to look at the source code and it turns out the problem is >>>> when the following transformation is applied: >>>> >>>> U = np.dot(Q, Uhat) >>>> >>>> Replacing this by >>>> >>>> U = np.dot(Q.conj(), Uhat) >>>> >>>> solves the issue! Should I report this on github? >>>> >>>> On 10 August 2017 at 16:13, Olivier Grisel >>>> wrote: >>>>> I have no idea whether the randomized SVD method is supposed to work >>>>> for >>>>> complex data or not (from a mathematical point of view). I think that >>>>> all >>>>> scikit-learn estimators assume real data (or integer data for class >>>>> labels) >>>>> and our input validation utilities will cast numeric values to float64 >>>>> by >>>>> default. This might be the cause of your problem. Have a look at the >>>>> source >>>>> code to confirm. The reference to the paper can also be found in the >>>>> docstring of those functions. >>>>> >>>>> -- >>>>> Olivier >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Fri Aug 11 12:45:31 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 11 Aug 2017 18:45:31 +0200 Subject: [scikit-learn] Truncated svd not working for complex matrices In-Reply-To: References:

Message-ID: <20170811164531.GB3756445@phare.normalesup.org> On Fri, Aug 11, 2017 at 12:37:12PM -0400, Andreas Mueller wrote: > I opened https://github.com/scikit-learn/scikit-learn/issues/9528 > I suggest to first error everywhere and then fix those for which it seems > easy and worth it, as Joel said, probably mostly in decomposition. > Though adding support even in a few places seems like dangerous feature > creep. I am trying to predent that I am offline and in vacations, so I shouldn't answer. But I do have a clear cut opinion here. I believe that we should decide _not_ to support complex data everywhere. The reason is that the support for complex data will always be incomplete and risks being buggy. Indeed, complex data is very infrequent in machine learning (unlike with signal processing). Hence, it will recieve little usage. In addition, many machine learning algorithms cannot easily be adapted to complex data. To manage user expectation and to ensure quality of the codebase, let us error on complex data. Should we move this discussion on the issue opened by Andy? Ga?l > On 08/11/2017 03:16 AM, Raphael C wrote: > >Although the first priority should be correctness (in implementation > >and documentation) and it makes sense to explicitly test for inputs > >for which code will give the wrong answer, it would be great if we > >could support complex data types, especially where it is very little > >extra work. > >Raphael > >On 11 August 2017 at 05:41, Joel Nothman wrote: > >>Should we be more explicitly forbidding complex data in most estimators, and > >>perhaps allow it in a few where it is tested (particularly decomposition)? > >>On 11 August 2017 at 01:08, Andr? Melo > >>wrote: > >>>Actually, it makes more sense to change > >>> B = safe_sparse_dot(Q.T, M) > >>>To > >>> B = safe_sparse_dot(Q.T.conj(), M) > >>>On 10 August 2017 at 16:56, Andr? Melo > >>>wrote: > >>>>Hi Olivier, > >>>>Thank you very much for your reply. I was convinced it couldn't be a > >>>>fundamental mathematical issue because the singular values were coming > >>>>out exactly right, so it had to be a problem with the way complex > >>>>values were being handled. > >>>>I decided to look at the source code and it turns out the problem is > >>>>when the following transformation is applied: > >>>>U = np.dot(Q, Uhat) > >>>>Replacing this by > >>>>U = np.dot(Q.conj(), Uhat) > >>>>solves the issue! Should I report this on github? > >>>>On 10 August 2017 at 16:13, Olivier Grisel > >>>>wrote: > >>>>>I have no idea whether the randomized SVD method is supposed to work > >>>>>for > >>>>>complex data or not (from a mathematical point of view). I think that > >>>>>all > >>>>>scikit-learn estimators assume real data (or integer data for class > >>>>>labels) > >>>>>and our input validation utilities will cast numeric values to float64 > >>>>>by > >>>>>default. This might be the cause of your problem. Have a look at the > >>>>>source > >>>>>code to confirm. The reference to the paper can also be found in the > >>>>>docstring of those functions. > >>>>>-- > >>>>>Olivier > >>>>>_______________________________________________ > >>>>>scikit-learn mailing list > >>>>>scikit-learn at python.org > >>>>>https://mail.python.org/mailman/listinfo/scikit-learn > >>>_______________________________________________ > >>>scikit-learn mailing list > >>>scikit-learn at python.org > >>>https://mail.python.org/mailman/listinfo/scikit-learn > >>_______________________________________________ > >>scikit-learn mailing list > >>scikit-learn at python.org > >>https://mail.python.org/mailman/listinfo/scikit-learn > >_______________________________________________ > >scikit-learn mailing list > >scikit-learn at python.org > >https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From olivier.grisel at ensta.org Fri Aug 11 17:49:13 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 11 Aug 2017 23:49:13 +0200 Subject: [scikit-learn] scikit-learn 0.19.0 is out! Message-ID: Grab it with pip or conda ! Quoting the release highlights from the website: We are excited to release a number of great new features including neighbors.LocalOutlierFactor for anomaly detection, preprocessing.QuantileTransformer for robust feature transformation, and the multioutput.ClassifierChain meta-estimator to simply account for dependencies between classes in multilabel problems. We have some new algorithms in existing estimators, such as multiplicative update in decomposition.NMF and multinomial linear_model.LogisticRegression with L1 loss (use solver='saga'). Cross validation is now able to return the results from multiple metric evaluations. The new model_selection.cross_validate can return many scores on the test data as well as training set performance and timings, and we have extended the scoring and refit parameters for grid/randomized search to handle multiple metrics. You can also learn faster. For instance, the new option to cache transformations in pipeline.Pipeline makes grid search over pipelines including slow transformations much more efficient. And you can predict faster: if you?re sure you know what you?re doing, you can turn off validating that the input is finite using config_context. We?ve made some important fixes too. We?ve fixed a longstanding implementation error in metrics.average_precision_score, so please be cautious with prior results reported from that function. A number of errors in the manifold.TSNE implementation have been fixed, particularly in the default Barnes-Hut approximation. semi_supervised.LabelSpreading and semi_supervised.LabelPropagation have had substantial fixes. LabelPropagation was previously broken. LabelSpreading should now correctly respect its alpha parameter. Please see the full changelog at: http://scikit-learn.org/0.19/whats_new.html#version-0-19 Notably some models have changed behaviors (bug fixes) and some methods or parameters part of the public API have been deprecated. A big thank you to anyone who made this release possible and Joel in particular. -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Aug 11 17:57:03 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 11 Aug 2017 17:57:03 -0400 Subject: [scikit-learn] scikit-learn 0.19.0 is out! In-Reply-To: References: