[scikit-learn] scikit-learn Digest, Vol 43, Issue 10

Sat Oct 5 14:55:33 EDT 2019

 1. Re: Can Scikit-learn decision tree (CART) have both
      continuous and categorical features? (C W)

What I'd ask in reply to this is if regression and classification module
results can be entered into an input for one resultant output.

On Sat, Oct 5, 2019, 11:50 AM , <scikit-learn-request at python.org> wrote:

> Send scikit-learn mailing list submissions to
>         scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-request at python.org
>
> You can reach the person managing the list at
>         scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. Re: Can Scikit-learn decision tree (CART) have both
>       continuous and categorical features? (C W)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 5 Oct 2019 14:50:09 -0400
> From: C W <tmrsg11 at gmail.com>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Can Scikit-learn decision tree (CART) have
>         both continuous and categorical features?
> Message-ID:
>         <
> CAE2FW2nHDJGNky2VWk-U8fU3gqwBqWEgidzTAWnUq+NzAK68VA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Thanks, great material! I got pydotplus with graphviz to work.
>
> Using the code on sklean website [1], tree.plot_tree(clf.fit(iris.data,
> iris.target)) gives an error:
> AttributeError: module 'sklearn.tree' has no attribute 'plot_tree'
>
> Both my colleague and I got the same error message. Per this post
> https://github.com/Microsoft/LightGBM/issues/1844, a PyPI update is
> needed.
>
> [1] sklearn link:
> https://scikit-learn.org/stable/modules/tree.html#classification
>
>
> On Fri, Oct 4, 2019 at 11:52 PM Sebastian Raschka <
> mail at sebastianraschka.com>
> wrote:
>
> > The docs show a way such that you don't need to write it as png file
> using
> > tree.plot_tree:
> > https://scikit-learn.org/stable/modules/tree.html#classification
> >
> > I don't remember why, but I think I had problems with that in the past (I
> > think it didn't look so nice visually, but don't remember), which is why
> I
> > still stick to graphviz. For my use cases, it's not much hassle -- it
> used
> > to be a bit of a hassle to get GraphViz working, but now you can do
> >
> > conda install pydotplus
> > conda install graphviz
> >
> > Coincidentally, I just made an example for a lecture I was teaching on
> > Tue:
> >
> https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb
> >
> > Best,
> > Sebastian
> >
> >
> > > On Oct 4, 2019, at 10:09 PM, C W <tmrsg11 at gmail.com> wrote:
> > >
> > > On a separate note, what do you use for plotting?
> > >
> > > I found graphviz, but you have to first save it as a png on your
> > computer. That's a lot work for just one plot. Is there something like a
> > matplotlib?
> > >
> > > Thanks!
> > >
> > > On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka <
> > mail at sebastianraschka.com> wrote:
> > > Yeah, think of it more as a computational workaround for achieving the
> > same thing more efficiently (although it looks inelegant/weird)--
> something
> > like that wouldn't be mentioned in textbooks.
> > >
> > > Best,
> > > Sebastian
> > >
> > > > On Oct 4, 2019, at 6:33 PM, C W <tmrsg11 at gmail.com> wrote:
> > > >
> > > > Thanks Sebastian, I think I get it.
> > > >
> > > > It's just have never seen it this way. Quite different from what I'm
> > used in Elements of Statistical Learning.
> > > >
> > > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka <
> > mail at sebastianraschka.com> wrote:
> > > > Not sure if there's a website for that. In any case, to explain this
> > differently, as discussed earlier sklearn assumes continuous features for
> > decision trees. So, it will use a binary threshold for splitting along a
> > feature attribute. In other words, it cannot do sth like
> > > >
> > > > if x == 1 then right child node
> > > > else left child node
> > > >
> > > > Instead, what it does is
> > > >
> > > > if x >= 0.5 then right child node
> > > > else left child node
> > > >
> > > > These are basically equivalent as you can see when you just plug in
> > values 0 and 1 for x.
> > > >
> > > > Best,
> > > > Sebastian
> > > >
> > > > > On Oct 4, 2019, at 5:34 PM, C W <tmrsg11 at gmail.com> wrote:
> > > > >
> > > > > I don't understand your answer.
> > > > >
> > > > > Why after one-hot-encoding it still outputs greater than 0.5 or
> less
> > than? Does sklearn website have a working example on categorical input?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka <
> > mail at sebastianraschka.com> wrote:
> > > > > Like Nicolas said, the 0.5 is just a workaround but will do the
> > right thing on the one-hot encoded variables, here. You will find that
> the
> > threshold is always at 0.5 for these variables. I.e., what it will do is
> to
> > use the following conversion:
> > > > >
> > > > > treat as car_Audi=1 if car_Audi >= 0.5
> > > > > treat as car_Audi=0 if car_Audi < 0.5
> > > > >
> > > > > or, it may be
> > > > >
> > > > > treat as car_Audi=1 if car_Audi > 0.5
> > > > > treat as car_Audi=0 if car_Audi <= 0.5
> > > > >
> > > > > (Forgot which one sklearn is using, but either way. it will be
> fine.)
> > > > >
> > > > > Best,
> > > > > Sebastian
> > > > >
> > > > >
> > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <niourf at gmail.com> wrote:
> > > > >>
> > > > >>
> > > > >>> But, decision tree is still mistaking one-hot-encoding as
> > numerical input and split at 0.5. This is not right. Perhaps, I'm doing
> > something wrong?
> > > > >>
> > > > >> You're not doing anything wrong, and neither is the tree. Trees
> > don't support categorical variables in sklearn, so everything is treated
> as
> > numerical.
> > > > >>
> > > > >> This is why we do one-hot-encoding: so that a set of numerical
> (one
> > hot encoded) features can be treated as if they were just one categorical
> > feature.
> > > > >>
> > > > >>
> > > > >>
> > > > >> Nicolas
> > > > >>
> > > > >> On 10/4/19 2:01 PM, C W wrote:
> > > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So,
> > typo on my part.
> > > > >>>
> > > > >>> Looks like I did one-hot-encoding correctly. My new variable
> names
> > are: car_Audi, car_BMW, etc.
> > > > >>>
> > > > >>> But, decision tree is still mistaking one-hot-encoding as
> > numerical input and split at 0.5. This is not right. Perhaps, I'm doing
> > something wrong?
> > > > >>>
> > > > >>> Is there a good toy example on the sklearn website? I am only see
> > this:
> >
> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
> > .
> > > > >>>
> > > > >>> Thanks!
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <
> > mail at sebastianraschka.com> wrote:
> > > > >>> Hi,
> > > > >>>
> > > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0,
> > Toyota=1, Audi=2) as numerical values, not category.The tree splits at
> 0.5
> > and 1.5
> > > > >>>
> > > > >>> that's not a onehot encoding then.
> > > > >>>
> > > > >>> For an Audi datapoint, it should be
> > > > >>>
> > > > >>> BMW=0
> > > > >>> Toyota=0
> > > > >>> Audi=1
> > > > >>>
> > > > >>> for BMW
> > > > >>>
> > > > >>> BMW=1
> > > > >>> Toyota=0
> > > > >>> Audi=0
> > > > >>>
> > > > >>> and for Toyota
> > > > >>>
> > > > >>> BMW=0
> > > > >>> Toyota=1
> > > > >>> Audi=0
> > > > >>>
> > > > >>> The split threshold should then be at 0.5 for any of these
> > features.
> > > > >>>
> > > > >>> Based on your email, I think you were assuming that the DT does
> > the one-hot encoding internally, which it doesn't. In practice, it is
> hard
> > to guess what is a nominal and what is a ordinal variable, so you have to
> > do the onehot encoding before you give the data to the decision tree.
> > > > >>>
> > > > >>> Best,
> > > > >>> Sebastian
> > > > >>>
> > > > >>>> On Oct 4, 2019, at 11:48 AM, C W <tmrsg11 at gmail.com> wrote:
> > > > >>>>
> > > > >>>> I'm getting some funny results. I am doing a regression decision
> > tree, the response variables are assigned to levels.
> > > > >>>>
> > > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0,
> > Toyota=1, Audi=2) as numerical values, not category.
> > > > >>>>
> > > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding
> > wrong? How does the sklearn know internally 0 vs. 1 is categorical, not
> > numerical?
> > > > >>>>
> > > > >>>> In R for instance, you do as.factor(), which explicitly states
> > the data type.
> > > > >>>>
> > > > >>>> Thank you!
> > > > >>>>
> > > > >>>>
> > > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <
> > t3kcit at gmail.com> wrote:
> > > > >>>>
> > > > >>>>
> > > > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote:
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W <tmrsg11 at gmail.com> wrote:
> > > > >>>>> Thanks, Guillaume.
> > > > >>>>> Column transformer looks pretty neat. I've also heard though,
> > this pipeline can be tedious to set up? Specifying what you want for
> every
> > feature is a pain.
> > > > >>>>>
> > > > >>>>> It would be interesting for us which part of the pipeline is
> > tedious to set up to know if we can improve something there.
> > > > >>>>> Do you mean, that you would like to automatically detect of
> > which type of feature (categorical/numerical) and apply a
> > > > >>>>> default encoder/scaling such as discuss there:
> >
> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
> > > > >>>>>
> > > > >>>>> IMO, one a user perspective, it would be cleaner in some cases
> > at the cost of applying blindly a black box
> > > > >>>>> which might be dangerous.
> > > > >>>> Also see
> >
> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
> > > > >>>> Which basically does that.
> > > > >>>>
> > > > >>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Jaiver,
> > > > >>>>> Actually, you guessed right. My real data has only one
> numerical
> > variable, looks more like this:
> > > > >>>>>
> > > > >>>>> Gender Date            Income  Car   Attendance
> > > > >>>>> Male     2019/3/01   10000   BMW          Yes
> > > > >>>>> Female 2019/5/02    9000   Toyota          No
> > > > >>>>> Male     2019/7/15   12000    Audi           Yes
> > > > >>>>>
> > > > >>>>> I am predicting income using all other categorical variables.
> > Maybe it is catboost!
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>>
> > > > >>>>> M
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez <jlopez at ende.cc>
> > wrote:
> > > > >>>>> If you have datasets with many categorical features, and
> perhaps
> > many categories, the tools in sklearn are quite limited,
> > > > >>>>> but there are alternative implementations of boosted trees that
> > are designed with categorical features in mind. Take a look
> > > > >>>>> at catboost [1], which has an sklearn-compatible API.
> > > > >>>>>
> > > > >>>>> J
> > > > >>>>>
> > > > >>>>> [1] https://catboost.ai/
> > > > >>>>>
> > > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrsg11 at gmail.com> wrote:
> > > > >>>>> Hello all,
> > > > >>>>> I'm very confused. Can the decision tree module handle both
> > continuous and categorical features in the dataset? In this case, it's
> just
> > CART (Classification and Regression Trees).
> > > > >>>>>
> > > > >>>>> For example,
> > > > >>>>> Gender Age Income  Car   Attendance
> > > > >>>>> Male     30   10000   BMW          Yes
> > > > >>>>> Female 35     9000  Toyota          No
> > > > >>>>> Male     50   12000    Audi           Yes
> > > > >>>>>
> > > > >>>>> According to the documentation
> >
> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart
> ,
> > it can not!
> > > > >>>>>
> > > > >>>>> It says: "scikit-learn implementation does not support
> > categorical variables for now".
> > > > >>>>>
> > > > >>>>> Is this true? If not, can someone point me to an example? If
> > yes, what do people do?
> > > > >>>>>
> > > > >>>>> Thank you very much!
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> _______________________________________________
> > > > >>>>> scikit-learn mailing list
> > > > >>>>> scikit-learn at python.org
> > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > > >>>>> _______________________________________________
> > > > >>>>> scikit-learn mailing list
> > > > >>>>> scikit-learn at python.org
> > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > > >>>>> _______________________________________________
> > > > >>>>> scikit-learn mailing list
> > > > >>>>> scikit-learn at python.org
> > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> --
> > > > >>>>> Guillaume Lemaitre
> > > > >>>>> INRIA Saclay - Parietal team
> > > > >>>>> Center for Data Science Paris-Saclay
> > > > >>>>> https://glemaitre.github.io/
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> _______________________________________________
> > > > >>>>> scikit-learn mailing list
> > > > >>>>>
> > > > >>>>> scikit-learn at python.org
> > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > > >>>>
> > > > >>>> _______________________________________________
> > > > >>>> scikit-learn mailing list
> > > > >>>> scikit-learn at python.org
> > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > > >>>> _______________________________________________
> > > > >>>> scikit-learn mailing list
> > > > >>>> scikit-learn at python.org
> > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > > >>>
> > > > >>> _______________________________________________
> > > > >>> scikit-learn mailing list
> > > > >>> scikit-learn at python.org
> > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > > >>>
> > > > >>>
> > > > >>> _______________________________________________
> > > > >>> scikit-learn mailing list
> > > > >>>
> > > > >>> scikit-learn at python.org
> > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > > >> _______________________________________________
> > > > >> scikit-learn mailing list
> > > > >> scikit-learn at python.org
> > > > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > > > >
> > > > > _______________________________________________
> > > > > scikit-learn mailing list
> > > > > scikit-learn at python.org
> > > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > > > _______________________________________________
> > > > > scikit-learn mailing list
> > > > > scikit-learn at python.org
> > > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > >
> > > > _______________________________________________
> > > > scikit-learn mailing list
> > > > scikit-learn at python.org
> > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > > _______________________________________________
> > > > scikit-learn mailing list
> > > > scikit-learn at python.org
> > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/scikit-learn/attachments/20191005/7234be32/attachment.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 43, Issue 10
> ********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191005/14272924/attachment-0001.html>