[scikit-learn] Loading file in libsvm format

Thu Sep 8 14:45:22 EDT 2016

Oh, I just figured, it's the max value for term_id.
Sorry to disturb you ;)

Cheers

On Thu, Sep 8, 2016 at 8:40 PM, klo uo <klonuo at gmail.com> wrote:

>
> ---------- Forwarded message ----------
> From: klo uo <klonuo at gmail.com>
> Date: Thu, Sep 8, 2016 at 8:25 PM
> Subject: Loading file in libsvm format
> To: scikit-learn-general at lists.sourceforge.net
>
>
> Hi,
>
> I produced a file in libsvm format:
>
>     <label> <index1>:<value1> <index2>:<value2> ...
>
> with this content:
>
>     6284 576:1 884:1 2482:1 4279:1 5765:1 184552:1 661512:1 699842:1
>     2259 1669:1 5711528:6
>     2822 5765159:1
>     ...
>
> The label is document_id, and index:value are term_id and term count.
>
> This file has 83K labels with 40K unique terms (and overall 1.2M
> index:value pairs).
>
> When I load this file in sklearn:
>
>     from sklearn.datasets import load_svmlight_file
>     X, y = load_svmlight_file('libsim.txt')
>
> I get X with shape (82448, 6092168).
>
> I don't know of any reason why am I getting 6M features?
> Can someone explain?
>
>
> Thanks
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160908/0ce0b238/attachment.html>