[scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Sun Sep 17 19:52:51 EDT 2017

Hi Markus,

I tried your code and find the issue might be there are only 18 docs
in the Gutenberg
corpus.
if you print out transformed doc topic distribution, you will see a lot of
topics are not used.
And since there is no words assigned to those topics, the weights will be
equal to`topic_word_prior` parameter.

You can print out the transformed doc topic distributions like this:
-------------
>>> doc_distr = lda.fit_transform(tf)

>>> for d in doc_distr:
...     print np.where(d > 0.001)[0]
...
[17 27]
[17 27]
[17 27 28]
[14]
[ 2  4 28]
[ 2  4 15 21 27 28]
[1]
[ 1  2 17 21 27 28]
[ 2 15 17 22 28]
[ 2 17 21 22 27 28]
[ 2 15 17 28]
[ 2 17 21 27 28]
[ 2 14 15 17 21 22 27 28]
[15 22]
[ 8 11]
[8]
[ 8 24]
[ 2 14 15 22]

and my full test scripts are here:
https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826

Best,
Chyi-Kwei

On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.konrad at wzb.eu> wrote:

> Hi there,
>
> I'm trying out sklearn's latent Dirichlet allocation implementation for
> topic modeling. The code from the official example [1] works just fine and
> the extracted topics look reasonable. However, when I try other corpora,
> for example the Gutenberg corpus from NLTK, most of the extracted topics
> are garbage. See this example output, when trying to get 30 topics:
>
> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
> fatiguing (0.01)
> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane
> (301.83)
> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
> fatiguing (0.01)
> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother
> (55.27)
> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles
> (166.21)
> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
> fatiguing (0.01)
> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
> fatiguing (0.01)
> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
> fatiguing (0.01)
> ...
>
> Many topics tend to have the same weights, all equal to the
> `topic_word_prior` parameter.
>
> This is my script:
>
> import nltk
> from sklearn.feature_extraction.text import CountVectorizer
> from sklearn.decomposition import LatentDirichletAllocation
>
> def print_top_words(model, feature_names, n_top_words):
>     for topic_idx, topic in enumerate(model.components_):
>         message = "Topic #%d: " % topic_idx
>         message += " ".join([feature_names[i] + " (" + str(round(topic[i],
> 2)) + ")"
>                              for i in topic.argsort()[:-n_top_words -
> 1:-1]])
>         print(message)
>
>
> data_samples = [nltk.corpus.gutenberg.raw(f_id)
>                for f_id in nltk.corpus.gutenberg.fileids()]
>
> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
>                                 stop_words='english')
> tf = tf_vectorizer.fit_transform(data_samples)
>
> lda = LatentDirichletAllocation(n_components=30,
>                                 learning_method='batch',
>                                 n_jobs=-1,  # all CPUs
>                                 verbose=1,
>                                 evaluate_every=10,
>                                 max_iter=1000,
>                                 doc_topic_prior=0.1,
>                                 topic_word_prior=0.01,
>                                 random_state=1)
>
> lda.fit(tf)
> tf_feature_names = tf_vectorizer.get_feature_names()
> print_top_words(lda, tf_feature_names, 5)
>
> Is there a problem in how I set up the LatentDirichletAllocation instance
> or pass the data? I tried out different parameter settings, but none of
> them provided good results for that corpus. I also tried out alternative
> implementations (like the lda package [2]) and those were able to find
> reasonable topics.
>
> Best,
> Markus
>
>
> [1]
> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
> [2] http://pythonhosted.org/lda/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170917/1910f50b/attachment.html>