[scikit-learn] Help With Text Classification

pybokeh pybokeh at gmail.com
Wed Aug 2 22:01:36 EDT 2017


Hello,
I am studying this example from scikit-learn's site:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_
data.html

The problem that I need to solve is very similar to this example, except I
have one
additional feature column (part #) that is categorical of type string.  My
label or target
values consist of just 2 values: 0 or 1.

With that additional feature column, I am transforming it with a
LabelEncoder and
then I am encoding it with the OneHotEncoder.

Then I am concatenating that one-hot encoded column (part #) to the
text/document
feature column (complaint), which I had applied the CountVectorizer and
TfidfTransformer transformations.

Then I chose the MultinomialNB model to fit my concatenated training data
with.

The problem I run into is when I invoke the prediction, I get a dimension
mis-match error.

Here's my jupyter notebook gist:
http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85ef86ba41424b311

I would gladly appreciate it if someone can guide me where I went wrong.
Thanks!

- Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170802/172acff4/attachment.html>


More information about the scikit-learn mailing list