[scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Tue Sep 19 04:26:41 EDT 2017

This is indeed interesting. I didn't know that there are so big
differences between these approaches. I split the 18 documents into
sub-documents of 5 paragraphs each, so that I got around 10k of these
sub-documents. Now, scikit-learn and gensim deliver much better results,
quite similar to those from a Gibbs sampling based implementation. So it
was basically the same data, just split in a different way.

I think the disadvantages/limits of the Variational Bayes approach
should be mentioned in the documentation.

Best,
Markus

On 09/18/2017 06:59 PM, Andreas Mueller wrote:
> For very few documents, Gibbs sampling is likely to work better - or
> rather, Gibbs sampling usually works
> better given enough runtime, and for so few documents, runtime is not an
> issue.
> The length of the documents don't matter, only the size of the vocabulary.
> Also, hyper parameter choices might need to be different for Gibbs
> sampling vs variational inference.
> 
> On 09/18/2017 12:26 PM, Markus Konrad wrote:
>> Hi Chyi-Kwei,
>>
>> thanks for digging into this. I made similar observations with Gensim
>> when using only a small number of (big) documents. Gensim also uses the
>> Online Variational Bayes approach (Hoffman et al.). So could it be that
>> the Hoffman et al. method is problematic in such scenarios? I found that
>> Gibbs sampling based implementations provide much more informative
>> topics in this case.
>>
>> If this was the case, then if I'd slice the documents in some way (say
>> every N paragraphs become a "document") then I should get better results
>> with scikit-learn and Gensim, right? I think I'll try this out tomorrow.
>>
>> Best,
>> Markus
>>
>>
>>
>>> Date: Sun, 17 Sep 2017 23:52:51 +0000
>>> From: chyi-kwei yau <chyikwei.yau at gmail.com>
>>> To: Scikit-learn mailing list <scikit-learn at python.org>
>>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find
>>>     topics in NLTK Gutenberg corpus?
>>> Message-ID:
>>>     <CAK-jh0Ygd8fSdJom+gdDOHvAYCPuJVHHX77qcd+d4_xm6vi9yA at mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Hi Markus,
>>>
>>> I tried your code and find the issue might be there are only 18 docs
>>> in the Gutenberg
>>> corpus.
>>> if you print out transformed doc topic distribution, you will see a
>>> lot of
>>> topics are not used.
>>> And since there is no words assigned to those topics, the weights
>>> will be
>>> equal to`topic_word_prior` parameter.
>>>
>>> You can print out the transformed doc topic distributions like this:
>>> -------------
>>>>>> doc_distr = lda.fit_transform(tf)
>>>>>> for d in doc_distr:
>>> ...     print np.where(d > 0.001)[0]
>>> ...
>>> [17 27]
>>> [17 27]
>>> [17 27 28]
>>> [14]
>>> [ 2  4 28]
>>> [ 2  4 15 21 27 28]
>>> [1]
>>> [ 1  2 17 21 27 28]
>>> [ 2 15 17 22 28]
>>> [ 2 17 21 22 27 28]
>>> [ 2 15 17 28]
>>> [ 2 17 21 27 28]
>>> [ 2 14 15 17 21 22 27 28]
>>> [15 22]
>>> [ 8 11]
>>> [8]
>>> [ 8 24]
>>> [ 2 14 15 22]
>>>
>>> and my full test scripts are here:
>>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826
>>>
>>> Best,
>>> Chyi-Kwei
>>>
>>>
>>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.konrad at wzb.eu>
>>> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I'm trying out sklearn's latent Dirichlet allocation implementation for
>>>> topic modeling. The code from the official example [1] works just
>>>> fine and
>>>> the extracted topics look reasonable. However, when I try other
>>>> corpora,
>>>> for example the Gutenberg corpus from NLTK, most of the extracted
>>>> topics
>>>> are garbage. See this example output, when trying to get 30 topics:
>>>>
>>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>> (0.01)
>>>> fatiguing (0.01)
>>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane
>>>> (301.83)
>>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>> (0.01)
>>>> fatiguing (0.01)
>>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother
>>>> (55.27)
>>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07)
>>>> charles
>>>> (166.21)
>>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>> (0.01)
>>>> fatiguing (0.01)
>>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>> (0.01)
>>>> fatiguing (0.01)
>>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>> (0.01)
>>>> fatiguing (0.01)
>>>> ...
>>>>
>>>> Many topics tend to have the same weights, all equal to the
>>>> `topic_word_prior` parameter.
>>>>
>>>> This is my script:
>>>>
>>>> import nltk
>>>> from sklearn.feature_extraction.text import CountVectorizer
>>>> from sklearn.decomposition import LatentDirichletAllocation
>>>>
>>>> def print_top_words(model, feature_names, n_top_words):
>>>>      for topic_idx, topic in enumerate(model.components_):
>>>>          message = "Topic #%d: " % topic_idx
>>>>          message += " ".join([feature_names[i] + " (" +
>>>> str(round(topic[i],
>>>> 2)) + ")"
>>>>                               for i in topic.argsort()[:-n_top_words -
>>>> 1:-1]])
>>>>          print(message)
>>>>
>>>>
>>>> data_samples = [nltk.corpus.gutenberg.raw(f_id)
>>>>                 for f_id in nltk.corpus.gutenberg.fileids()]
>>>>
>>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
>>>>                                  stop_words='english')
>>>> tf = tf_vectorizer.fit_transform(data_samples)
>>>>
>>>> lda = LatentDirichletAllocation(n_components=30,
>>>>                                  learning_method='batch',
>>>>                                  n_jobs=-1,  # all CPUs
>>>>                                  verbose=1,
>>>>                                  evaluate_every=10,
>>>>                                  max_iter=1000,
>>>>                                  doc_topic_prior=0.1,
>>>>                                  topic_word_prior=0.01,
>>>>                                  random_state=1)
>>>>
>>>> lda.fit(tf)
>>>> tf_feature_names = tf_vectorizer.get_feature_names()
>>>> print_top_words(lda, tf_feature_names, 5)
>>>>
>>>> Is there a problem in how I set up the LatentDirichletAllocation
>>>> instance
>>>> or pass the data? I tried out different parameter settings, but none of
>>>> them provided good results for that corpus. I also tried out
>>>> alternative
>>>> implementations (like the lda package [2]) and those were able to find
>>>> reasonable topics.
>>>>
>>>> Best,
>>>> Markus
>>>>
>>>>
>>>> [1]
>>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
>>>>
>>>> [2] http://pythonhosted.org/lda/
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn