[scikit-learn] Supervised prediction of multiple scores for a document

Sun Jun 3 17:03:08 EDT 2018

Héllo,

I started a natural language processing project a few weeks ago called
wikimark <https://github.com/amirouche/wikimark/> (the code is all in
wikimark.py
<https://github.com/amirouche/wikimark/blob/master/wikimark.py#L1>)

Given a text it wants to return a dictionary scoring the input against vital
articles categories
<https://en.wikipedia.org/api/rest_v1/page/html/Wikipedia%3AVital_articles%2FLevel%2F5>,
e.g.:

out = wikimark("""Peter Hintjens wrote about the relation between
technology and culture. Without using a scientifical tone of
state-of-the-art review of the anthroposcene antropology, he gives a fair
amount of food for thought. According to Hintjens, technology is doomed to
become cheap. As matter of fact, intelligence tools will become more and
more accessible which will trigger a revolution to rebalance forces in
society.""")

for category, score in out:
    print('{} ~ {}'.format(category, score))

The above program would output something like that:

Art ~ 0.1
Science ~ 0.5
Society ~ 0.4

Except not everything went as planned. Mind the fact that in the above
example the total is equal to 1, but I could not achieve that at all.

I am using gensim to compute vectors of paragraphs (doc2vev) and then
submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is
scored 1 if it's in that subcategory and zero otherwise. At prediction
time, it goes though the same doc2vec pipeline. The computer will score *each
paragraph* against the SVR models of wikipedia vital article subcategories
and get a value between 0 and 1 for *each paragraph*. I compute the sum and
group by subcategory and then I have a score per category for the input
document

It somewhat works. I made a web ui online you can find it at
https://sensimark.com where you can test it. You can directly access the
full api e.g.
https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1

The output JSON document is a list of category dictionary where the
prediction key is associated with the average of the "prediction" of the
subcategories. If you replace &all=1 by &top=5 you might get something else
as top categories e.g.
https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10

<https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1>
or

https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5

I wrote "prediction" with double quotes because the value you see, is the
result of some formula. Since, the predictions I get are rather small
between 0 and 0.015 I apply the following formula:

value = math.exp(prediction)
magic = ((value * 100) - 110) * 100

In order to have values to spread between -200 and 200. Maybe this is the
symptom that my model doesn't work at all.

Still, the top 10 results are almost always near each other (try with BBC
<http://www.bbc.com/> articles on https://sensimark.com . It is only when a
regression model is disqualified with a score of 0 that the results are
simple to understand. Sadly, I don't have an example at hand to support
that claim. You have to believe me.

I just figured looking at the machine learning map
<http://scikit-learn.org/stable/tutorial/machine_learning_map/> that my
problem might be classification problem, except I don't really want to know
what is *the* class of new documents, I want to how what are the different
subjects that are dealt in the document based on a hiearchical corpus;
I don't want to guess a hiearchy! I want to now how the document content
spread over the different categories or subcategories.

I quickly read about multinomal regression, is it something do you
recommend I use? Maybe you think about something else?

Also, it seems I should benchmark / evaluate my model against LDA.

I am rather noob in terms of datascience and my math skills are not so
fresh. I more likely looking for ideas on what algorithm, fine tuning and
some practice of datascience I must follow that doesn't involve writing my
own algorithm.

Thanks in advance!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180603/c7dc7ad0/attachment.html>