[scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Thu Sep 14 10:10:52 EDT 2017

Hi there,

I'm trying out sklearn's latent Dirichlet allocation implementation for topic modeling. The code from the official example [1] works just fine and the extracted topics look reasonable. However, when I try other corpora, for example the Gutenberg corpus from NLTK, most of the extracted topics are garbage. See this example output, when trying to get 30 topics:

Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01)
Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane (301.83)
Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01)
Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother (55.27)
Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles (166.21)
Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01)
Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01)
Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01)
...

Many topics tend to have the same weights, all equal to the `topic_word_prior` parameter.

This is my script:

import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i] + " (" + str(round(topic[i], 2)) + ")"
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)

data_samples = [nltk.corpus.gutenberg.raw(f_id)
               for f_id in nltk.corpus.gutenberg.fileids()]

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

lda = LatentDirichletAllocation(n_components=30,
                                learning_method='batch',
                                n_jobs=-1,  # all CPUs
                                verbose=1,
                                evaluate_every=10,
                                max_iter=1000,
                                doc_topic_prior=0.1,
                                topic_word_prior=0.01,
                                random_state=1)

lda.fit(tf)
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)

Is there a problem in how I set up the LatentDirichletAllocation instance or pass the data? I tried out different parameter settings, but none of them provided good results for that corpus. I also tried out alternative implementations (like the lda package [2]) and those were able to find reasonable topics.

Best,
Markus

[1] http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
[2] http://pythonhosted.org/lda/