[scikit-learn] MultiLabelBinarizer gives individual characters instead of the classes

Thu Sep 12 00:31:36 EDT 2019

Hi.

I am working on a Multi-label text classification problem. In order to
encode the labels, I am using MultiLabelBinarizer. The labels of the
dataset look like -

[image: image]
<https://user-images.githubusercontent.com/22957388/64753547-42b10a00-d541-11e9-80b2-f0a9245df327.png>

When I am using

mlb = MultiLabelBinarizer()
mlb.fit(labels)print(mlb.classes_)

I am getting -

[image: image]
<https://user-images.githubusercontent.com/22957388/64753625-78ee8980-d541-11e9-8833-a17769f1bf47.png>

Whereas, the output (sample output) I want is -

[image: image]
<https://user-images.githubusercontent.com/22957388/64753641-89066900-d541-11e9-98fb-fb9f9e1e7305.png>

I got the above output by -

mlb = MultiLabelBinarizer()
sample_labels = [
    ['stat.ML', 'cs.LG'],
    ['cs.CV', 'cs.RO']
]
mlb.fit(sample_labels)print(mlb.classes_)

Help would be very much appreciated here.

Here's the dataset I had prepared:
arXivdata.csv.zip
<https://github.com/scikit-learn/scikit-learn/files/3603687/arXivdata.csv.zip>

I stripped away the double quotes in the labels after loading it in a
pandas DataFrame
by -

import re

arxiv_data['labels'] = arxiv_data['labels'].str.replace(r"[\"]", '')

scikit-learn version: '0.21.3'

Sayak Paul | sayak.dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190912/4d7357e1/attachment-0001.html>