[scikit-learn] MultiLabelBinarizer gives individual characters instead of the classes
Sayak Paul
spsayakpaul at gmail.com
Thu Sep 12 00:31:36 EDT 2019
Hi.
I am working on a Multi-label text classification problem. In order to
encode the labels, I am using MultiLabelBinarizer. The labels of the
dataset look like -
[image: image]
<https://user-images.githubusercontent.com/22957388/64753547-42b10a00-d541-11e9-80b2-f0a9245df327.png>
When I am using
mlb = MultiLabelBinarizer()
mlb.fit(labels)print(mlb.classes_)
I am getting -
[image: image]
<https://user-images.githubusercontent.com/22957388/64753625-78ee8980-d541-11e9-8833-a17769f1bf47.png>
Whereas, the output (sample output) I want is -
[image: image]
<https://user-images.githubusercontent.com/22957388/64753641-89066900-d541-11e9-98fb-fb9f9e1e7305.png>
I got the above output by -
mlb = MultiLabelBinarizer()
sample_labels = [
['stat.ML', 'cs.LG'],
['cs.CV', 'cs.RO']
]
mlb.fit(sample_labels)print(mlb.classes_)
Help would be very much appreciated here.
Here's the dataset I had prepared:
arXivdata.csv.zip
<https://github.com/scikit-learn/scikit-learn/files/3603687/arXivdata.csv.zip>
I stripped away the double quotes in the labels after loading it in a
pandas DataFrame
by -
import re
arxiv_data['labels'] = arxiv_data['labels'].str.replace(r"[\"]", '')
scikit-learn version: '0.21.3'
Sayak Paul | sayak.dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190912/4d7357e1/attachment-0001.html>
More information about the scikit-learn
mailing list