From kevin at dataschool.io Wed Jul 1 15:40:15 2020 From: kevin at dataschool.io (Kevin Markham) Date: Wed, 1 Jul 2020 15:40:15 -0400 Subject: [scikit-learn] Best way to include SimpleImputer before CountVectorizer in a Pipeline? Message-ID: Hello! I have a DataFrame with a column of text, and I would like to vectorize the text using CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value (for any missing values) before vectorizing. My initial thought was to create a Pipeline of SimpleImputer (with strategy='constant') and CountVectorizer. However, SimpleImputer outputs a 2D array and CountVectorizer requires 1D input. The only solution I have found is to insert a transformer into the Pipeline that reshapes the output of SimpleImputer from 2D to 1D before it is passed to CountVectorizer. (You can find my code at the bottom of this message.) My question: Is there a more elegant solution to this problem than what I'm currently doing? Notes: - I realize that the missing values could be filled in pandas. However, I would like to accomplish all preprocessing in scikit-learn so that the same preprocessing can be applied via Pipeline to out-of-sample data. - I recall seeing a GitHub issue in which Andy proposed that CountVectorizer should allow 2D input as long as the second dimension is 1 (in other words: a single column of data). This modification to CountVectorizer would be a great long-term solution to my problem. However, I'm looking for a solution that would work in the current version of scikit-learn. Thank you so much for any feedback or ideas! Kevin == START OF CODE EXAMPLE == import pandas as pd import numpy as np from sklearn.impute import SimpleImputer from sklearn.preprocessing import FunctionTransformer from sklearn.feature_extraction.text import CountVectorizer from sklearn.pipeline import make_pipeline df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]}) imp = SimpleImputer(strategy='constant') one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1}) vect = CountVectorizer() pipe = make_pipeline(imp, one_dim, vect) pipe.fit_transform(df[['text']]).toarray() == END OF CODE EXAMPLE == -- Kevin Markham Founder, Data School https://www.dataschool.io https://www.youtube.com/dataschool -------------- next part -------------- An HTML attachment was scrubbed... URL: From neetu162 at gmail.com Sun Jul 5 12:05:44 2020 From: neetu162 at gmail.com (neetu agrawal) Date: Sun, 5 Jul 2020 21:35:44 +0530 Subject: [scikit-learn] cross_validate and ValueError: The first argument to `Layer.call` must always be passed Message-ID: I am trying to use cross_validate. I had initial hiccup due to pickable and was to get past that. Still I am not able to get the cross_validate to work. Git Link: https://github.com/Neetu162/DeepLearningResearch/blob/76675a79a4922b8bd0d722ab2e4cad448a8d8c76/Demo/classify_demo.py#L106 Error: Average recall value is: 0.9453125 creating the loaded model calling the cross_validate method/home/osboxes/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "/home/osboxes/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "/home/osboxes/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/wrappers/scikit_learn.py", line 223, in fit return super(KerasClassifier, self).fit(x, y, **kwargs) File "/home/osboxes/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/wrappers/scikit_learn.py", line 155, in fit **self.filter_sk_params(self.build_fn.__call__)) File "/home/osboxes/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 800, in __call__ 'The first argument to `Layer.call` must always be passed.')ValueError: The first argument to `Layer.call` must always be passed. FitFailedWarning) -- Thanks & Regards, Neetu -------------- next part -------------- An HTML attachment was scrubbed... URL: From marmochiaskl at gmail.com Tue Jul 21 06:14:16 2020 From: marmochiaskl at gmail.com (Chiara Marmo) Date: Tue, 21 Jul 2020 12:14:16 +0200 Subject: [scikit-learn] scikit-learn monthly meeting July 27th Message-ID: Dear list, The next scikit-learn monthly meeting will take place on Monday July 27th at 12PM UTC: https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=7&day=27&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 While these meetings are mainly for core-devs to discuss the current topics, we are also happy to welcome non-core devs and other project maintainers. Feel free to join, using the following link: https://meet.google.com/xhq-yoga-rtf If you plan to attend and you would like to discuss something specific about your contribution please add your name (or github pseudo) in the " Contributors " section, of the public pad: https://hackmd.io/RNVcxGUPRvyEBidf7ZOkag *@core devs, please make sure to update your notes before the week-end.* Best, Chiara -------------- next part -------------- An HTML attachment was scrubbed... URL: From matt.gregory at oregonstate.edu Fri Jul 31 12:02:57 2020 From: matt.gregory at oregonstate.edu (Gregory, Matthew) Date: Fri, 31 Jul 2020 16:02:57 +0000 Subject: [scikit-learn] custom estimator with more than two arguments to fit() Message-ID: Hi all, I'm fairly new to scikit-learn, but have been using a predictive model for a while now that would benefit from scikit-learn's estimator API. However, I could use some advice on how best to implement this. Briefly, the model is a combination of dimension reduction and nearest neighbors, but the dimension reduction step (canonical correspondence analysis - CCA) relies on two matrices to create the synthetic feature scores for the candidates in the nearest neighbor step. The two matrices are a "species" matrix (spp) and an "environmental" matrix (env) which are used to create orthogonal CCA axes that are linear combinations of the environmental features. In reading through the documentation on creating new estimators, it seems that every estimator should provide a fit(X, y) method. Somehow I need my X parameter to be both the spp and env matrices together. I got a lot of good inspiration from this post on Stack Overflow: https://stackoverflow.com/questions/45966500/use-sklearn-gridsearchcv-on-custom-class-whose-fit-method-takes-3-arguments and can mostly understand how the OP implemented this, basically by creating a DataHandler class that packs together the two matrices, such that the call to fit would look like: estimator.fit(DataHandler(spp, env), y) I'm wondering if this is the best way to handle the design or if I'm not fully understanding how I could use a Pipeline to accomplish the same goal. Thanks for any guidance - boilerplate sample code would be most appreciated! matt From niourf at gmail.com Fri Jul 31 12:10:41 2020 From: niourf at gmail.com (Nicolas Hug) Date: Fri, 31 Jul 2020 12:10:41 -0400 Subject: [scikit-learn] custom estimator with more than two arguments to fit() In-Reply-To: References: Message-ID: <137d525a-6fa8-876e-3491-a33ddacd1e2e@gmail.com> Hi Matt, We do have CCA and other PLS-related transformers / regressors in scikit-learn. They are able to do dimensionality reduction on both X and Y (which I believe correspond to spp and env), so you might want to have a look at these. However, they're not fully compatible with the whole ecosystem unfortunately: for example our Pipeline objects assume that only X can be transformed, not Y. Nicolas On 7/31/20 12:02 PM, Gregory, Matthew wrote: > Hi all, > > I'm fairly new to scikit-learn, but have been using a predictive model for a while now that would benefit from scikit-learn's estimator API. However, I could use some advice on how best to implement this. > > Briefly, the model is a combination of dimension reduction and nearest neighbors, but the dimension reduction step (canonical correspondence analysis - CCA) relies on two matrices to create the synthetic feature scores for the candidates in the nearest neighbor step. The two matrices are a "species" matrix (spp) and an "environmental" matrix (env) which are used to create orthogonal CCA axes that are linear combinations of the environmental features. > > In reading through the documentation on creating new estimators, it seems that every estimator should provide a fit(X, y) method. Somehow I need my X parameter to be both the spp and env matrices together. I got a lot of good inspiration from this post on Stack Overflow: > > https://stackoverflow.com/questions/45966500/use-sklearn-gridsearchcv-on-custom-class-whose-fit-method-takes-3-arguments > > and can mostly understand how the OP implemented this, basically by creating a DataHandler class that packs together the two matrices, such that the call to fit would look like: > > estimator.fit(DataHandler(spp, env), y) > > I'm wondering if this is the best way to handle the design or if I'm not fully understanding how I could use a Pipeline to accomplish the same goal. Thanks for any guidance - boilerplate sample code would be most appreciated! > > matt > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From matt.gregory at oregonstate.edu Fri Jul 31 12:26:43 2020 From: matt.gregory at oregonstate.edu (Gregory, Matthew) Date: Fri, 31 Jul 2020 16:26:43 +0000 Subject: [scikit-learn] custom estimator with more than two arguments to fit() In-Reply-To: <137d525a-6fa8-876e-3491-a33ddacd1e2e@gmail.com> References: <137d525a-6fa8-876e-3491-a33ddacd1e2e@gmail.com> Message-ID: Hi Nicolas, Nicolas Hug wrote: > We do have CCA and other PLS-related transformers / regressors in > scikit-learn. They are able to do dimensionality reduction on both > X and Y (which I believe correspond to spp and env), so you might > want to have a look at these. However, they're not fully > compatible with the whole ecosystem unfortunately: for example our > Pipeline objects assume that only X can be transformed, not Y. Just to clarify, I'm only seeing canonical *correlation* analysis and not canonical *correspondence* analysis (ter Braak) in scikit-learn? https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html But your point is taken - I can use this for inspiration because it has both X and Y matrices. But if I'm understanding correctly, there is no way to couple this with a further step of NearestNeighbors into a pipeline? I will only need the transformed scores coming out of CCA to feed into the NearestNeighbors step. Sorry if I'm not understanding this correctly. matt