[scikit-learn] MultiLabelBinarizer gives individual characters instead of the classes

Sayak Paul spsayakpaul at gmail.com
Thu Sep 12 00:31:36 EDT 2019


I am working on a Multi-label text classification problem. In order to
encode the labels, I am using MultiLabelBinarizer. The labels of the
dataset look like -

[image: image]

When I am using

mlb = MultiLabelBinarizer()

I am getting -

[image: image]

Whereas, the output (sample output) I want is -

[image: image]

I got the above output by -

mlb = MultiLabelBinarizer()
sample_labels = [
    ['stat.ML', 'cs.LG'],
    ['cs.CV', 'cs.RO']

Help would be very much appreciated here.

Here's the dataset I had prepared:

I stripped away the double quotes in the labels after loading it in a
pandas DataFrame
by -

import re

arxiv_data['labels'] = arxiv_data['labels'].str.replace(r"[\"]", '')

scikit-learn version: '0.21.3'

Sayak Paul | sayak.dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190912/4d7357e1/attachment-0001.html>

More information about the scikit-learn mailing list