[scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis?

Alex Garel alex at garel.org
Thu Apr 20 05:58:07 EDT 2017


I'm not totally sure of what you're trying to do, but here are some
remarks that may help you:

1. in modelfit = model.fit(count_vect, enc), the enc parameter is not
used, only the count_vect matrix is used
2. when you use kneighbors you get vectors corresponding to wiki['text']
not to wiki['name'], so it seems very strange to use
mod_enc.inverse_transform on it !

Maybe what you should better find those vectors in count_vect and read
"name" at corresponding row in your dataframe.

Hope it helps,

Alex

Le 16/04/2017 à 10:56, Evaristo Caraballo via scikit-learn a écrit :
> I have been asked to implement a simple knn for text similarity
> analysis. I tried by using sklearn.neighbors module.
> The file to be analysed consisted on 2 relevant columns: "text" and
> "name".
> The knn model should be fitted with bag-of-words of a corpus of around
> 60,000 pre-treated text fragments of about 200 words each. I used
> CounterVectorizer.
> As test I was asked to use the model to get the names in the "name"
> column related to 10 top text strings that are the closest to a
> pre-selected one that also exists in the corpus used to initialise the
> knn model. Similarity distance should be measured using an euclidean
> metric.
> I used the kneighbors function to obtain the closest neighbors.
> Below you can find the code I was trying to implement using kneighbors:
> |importos,sys importsklearn importsklearn.neighbors assk_neighbors
> fromsklearn.feature_extraction.text importCountVectorizerimportpandas
> importscipy importmatplotlib.pyplot asplt importnumpy asnp %matplotlib
> inline wiki =pandas.read_csv('wiki_filefragment.csv')mod_count_vect
> =CountVectorizer()count_vect
> =mod_count_vect.fit_transform(wiki['text'])print(count_vect.shape)mod_count_vect.get_feature_names()mod_enc
> =sklearn.preprocessing.LabelEncoder().fit(wiki['name'])enc
> =mod_enc.transform(wiki['name'])enc model
> =sk_neighbors.NearestNeighbors(n_neighbors=10,algorithm='brute',p
> =2)#no matter what I use, it is always the samemodelfit
> =model.fit(count_vect,enc)#also likely the kneighbors is not
> working?print(mod_enc.inverse_transform(modelfit.kneighbors(count_vect[mod_enc.transform(['Franz
> Rottensteiner'])],n_neighbors=11,return_distance=False)))|
> This implementation gave me the following results for the first 10
> nearest neighbors to 'Franz Rottensteiner':
>
>     Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III,
>     Tofusquirrel , M. G. Sheftall, Peter Maurer, Allan Weisbecker,
>     Ferdinand Knobloch, Andrea Foulkes, Alan W. Meerow, John Warner
>     (writer)
>
> The results continued to be far from being close to the test solution
> (which use Graphlab Create and SFrame), which are:
>
>     Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha,
>     Andr%C3%A9 Hurst, Leslie R. Landrum, Andrew Pinsent, Alan W.
>     Meerow, John Angus Campbell, Antonello Bonci, Henkjan Honing,
>     Joseph Born Kadane
>
> In fact, I tried a simple brute force implementation by iterating over
> the list of texts calculating distances with scipy and that gave me
> the expected results. The result was the same after also using Python 2.7.
> A link to the implementations (the one that doesn't work and the one
> that does) together a pick the file used for this test can be found on
> this Gist
> <https://gist.github.com/evaristoc/eb2f2d91524b874c4db6638359e32b0f>.
> Does anyone can suggest what it is wrong with my sklearn implementation?
> Relevant resources are: - Anaconda Python3.5 (with a virtenv using
> 2.7) - Jupyter - sklearn 0.18 - pandas
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170420/e339ffcc/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170420/e339ffcc/attachment-0001.sig>


More information about the scikit-learn mailing list