[scikit-learn] Help With Text Classification

Joel Nothman joel.nothman at gmail.com
Thu Aug 3 18:29:10 EDT 2017


pipeline helps in prediction time too.

On 4 Aug 2017 7:49 am, "pybokeh" <pybokeh at gmail.com> wrote:

> I found my problem.  When I one-hot encoded my test part #, it resulted in
> being a 1x1 matrix, when I need it to be a 1x153.  This happened because I
> used the default setting ('auto') for n_values, when I needed it set it to
> 153.  Now when I horizontally stacked it to my other feature matrix, the
> resulting total # of columns now correctly comes to 1294, instead of
> 1142.  Looking back now, not sure if using Pipeline or using FeatureUnion
> would have helped in this case or prevented this since this error occurred
> on the prediction side, not on training or modeling side.
>
> On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman <joel.nothman at gmail.com>
> wrote:
>
>> Use a Pipeline to help avoid this kind of issue (and others). You might
>> also want to do something like http://scikit-learn.org/stable
>> /auto_examples/hetero_feature_union.html
>>
>> On 3 August 2017 at 12:01, pybokeh <pybokeh at gmail.com> wrote:
>>
>>> Hello,
>>> I am studying this example from scikit-learn's site:
>>> http://scikit-learn.org/stable/tutorial/text_analytics/worki
>>> ng_with_text_data.html
>>>
>>> The problem that I need to solve is very similar to this example, except
>>> I have one
>>> additional feature column (part #) that is categorical of type string.
>>> My label or target
>>> values consist of just 2 values: 0 or 1.
>>>
>>> With that additional feature column, I am transforming it with a
>>> LabelEncoder and
>>> then I am encoding it with the OneHotEncoder.
>>>
>>> Then I am concatenating that one-hot encoded column (part #) to the
>>> text/document
>>> feature column (complaint), which I had applied the CountVectorizer and
>>> TfidfTransformer transformations.
>>>
>>> Then I chose the MultinomialNB model to fit my concatenated training
>>> data with.
>>>
>>> The problem I run into is when I invoke the prediction, I get a
>>> dimension mis-match error.
>>>
>>> Here's my jupyter notebook gist:
>>> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85
>>> ef86ba41424b311
>>>
>>> I would gladly appreciate it if someone can guide me where I went
>>> wrong.  Thanks!
>>>
>>> - Daniel
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170804/c3565b77/attachment.html>


More information about the scikit-learn mailing list