[scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Mon Sep 18 12:26:35 EDT 2017

Hi Chyi-Kwei,

thanks for digging into this. I made similar observations with Gensim
when using only a small number of (big) documents. Gensim also uses the
Online Variational Bayes approach (Hoffman et al.). So could it be that
the Hoffman et al. method is problematic in such scenarios? I found that
Gibbs sampling based implementations provide much more informative
topics in this case.

If this was the case, then if I'd slice the documents in some way (say
every N paragraphs become a "document") then I should get better results
with scikit-learn and Gensim, right? I think I'll try this out tomorrow.

Best,
Markus

> Date: Sun, 17 Sep 2017 23:52:51 +0000
> From: chyi-kwei yau <chyikwei.yau at gmail.com>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find
> 	topics in NLTK Gutenberg corpus?
> Message-ID:
> 	<CAK-jh0Ygd8fSdJom+gdDOHvAYCPuJVHHX77qcd+d4_xm6vi9yA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Markus,
> 
> I tried your code and find the issue might be there are only 18 docs
> in the Gutenberg
> corpus.
> if you print out transformed doc topic distribution, you will see a lot of
> topics are not used.
> And since there is no words assigned to those topics, the weights will be
> equal to`topic_word_prior` parameter.
> 
> You can print out the transformed doc topic distributions like this:
> -------------
>>>> doc_distr = lda.fit_transform(tf)
> 
>>>> for d in doc_distr:
> ...     print np.where(d > 0.001)[0]
> ...
> [17 27]
> [17 27]
> [17 27 28]
> [14]
> [ 2  4 28]
> [ 2  4 15 21 27 28]
> [1]
> [ 1  2 17 21 27 28]
> [ 2 15 17 22 28]
> [ 2 17 21 22 27 28]
> [ 2 15 17 28]
> [ 2 17 21 27 28]
> [ 2 14 15 17 21 22 27 28]
> [15 22]
> [ 8 11]
> [8]
> [ 8 24]
> [ 2 14 15 22]
> 
> and my full test scripts are here:
> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826
> 
> Best,
> Chyi-Kwei
> 
> 
> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.konrad at wzb.eu> wrote:
> 
>> Hi there,
>>
>> I'm trying out sklearn's latent Dirichlet allocation implementation for
>> topic modeling. The code from the official example [1] works just fine and
>> the extracted topics look reasonable. However, when I try other corpora,
>> for example the Gutenberg corpus from NLTK, most of the extracted topics
>> are garbage. See this example output, when trying to get 30 topics:
>>
>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
>> fatiguing (0.01)
>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane
>> (301.83)
>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
>> fatiguing (0.01)
>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother
>> (55.27)
>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles
>> (166.21)
>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
>> fatiguing (0.01)
>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
>> fatiguing (0.01)
>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
>> fatiguing (0.01)
>> ...
>>
>> Many topics tend to have the same weights, all equal to the
>> `topic_word_prior` parameter.
>>
>> This is my script:
>>
>> import nltk
>> from sklearn.feature_extraction.text import CountVectorizer
>> from sklearn.decomposition import LatentDirichletAllocation
>>
>> def print_top_words(model, feature_names, n_top_words):
>>     for topic_idx, topic in enumerate(model.components_):
>>         message = "Topic #%d: " % topic_idx
>>         message += " ".join([feature_names[i] + " (" + str(round(topic[i],
>> 2)) + ")"
>>                              for i in topic.argsort()[:-n_top_words -
>> 1:-1]])
>>         print(message)
>>
>>
>> data_samples = [nltk.corpus.gutenberg.raw(f_id)
>>                for f_id in nltk.corpus.gutenberg.fileids()]
>>
>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
>>                                 stop_words='english')
>> tf = tf_vectorizer.fit_transform(data_samples)
>>
>> lda = LatentDirichletAllocation(n_components=30,
>>                                 learning_method='batch',
>>                                 n_jobs=-1,  # all CPUs
>>                                 verbose=1,
>>                                 evaluate_every=10,
>>                                 max_iter=1000,
>>                                 doc_topic_prior=0.1,
>>                                 topic_word_prior=0.01,
>>                                 random_state=1)
>>
>> lda.fit(tf)
>> tf_feature_names = tf_vectorizer.get_feature_names()
>> print_top_words(lda, tf_feature_names, 5)
>>
>> Is there a problem in how I set up the LatentDirichletAllocation instance
>> or pass the data? I tried out different parameter settings, but none of
>> them provided good results for that corpus. I also tried out alternative
>> implementations (like the lda package [2]) and those were able to find
>> reasonable topics.
>>
>> Best,
>> Markus
>>
>>
>> [1]
>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
>> [2] http://pythonhosted.org/lda/