[scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?
Markus Konrad
markus.konrad at wzb.eu
Thu Sep 14 10:10:52 EDT 2017
Hi there,
I'm trying out sklearn's latent Dirichlet allocation implementation for topic modeling. The code from the official example [1] works just fine and the extracted topics look reasonable. However, when I try other corpora, for example the Gutenberg corpus from NLTK, most of the extracted topics are garbage. See this example output, when trying to get 30 topics:
Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01)
Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane (301.83)
Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01)
Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother (55.27)
Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles (166.21)
Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01)
Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01)
Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01)
...
Many topics tend to have the same weights, all equal to the `topic_word_prior` parameter.
This is my script:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i] + " (" + str(round(topic[i], 2)) + ")"
for i in topic.argsort()[:-n_top_words - 1:-1]])
print(message)
data_samples = [nltk.corpus.gutenberg.raw(f_id)
for f_id in nltk.corpus.gutenberg.fileids()]
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)
lda = LatentDirichletAllocation(n_components=30,
learning_method='batch',
n_jobs=-1, # all CPUs
verbose=1,
evaluate_every=10,
max_iter=1000,
doc_topic_prior=0.1,
topic_word_prior=0.01,
random_state=1)
lda.fit(tf)
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)
Is there a problem in how I set up the LatentDirichletAllocation instance or pass the data? I tried out different parameter settings, but none of them provided good results for that corpus. I also tried out alternative implementations (like the lda package [2]) and those were able to find reasonable topics.
Best,
Markus
[1] http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
[2] http://pythonhosted.org/lda/
More information about the scikit-learn
mailing list