[scikit-learn] Using perplexity from LatentDirichletAllocation for cross validation of Topic Models

Wed Oct 4 07:35:31 EDT 2017

Hi there,

I'm trying to find the optimal number of topics for Topic Modeling with
Latent Dirichlet Allocation. I implemented a 5-fold cross validation
method similar to the one described and implemented in R here [1]. I
basically split the full data into 5 equal sized chunks. Then for each
fold (`cur_fold`), 4 of 5 chunks are used for training and 1 for
validation using the `perplexity()` method on the held-out data set:

```
dtm_train = data[split_folds != cur_fold, :]
dtm_valid = data[split_folds == cur_fold, :]

lda_instance = LatentDirichletAllocation(**params)
lda_instance.fit(dtm_train)

perpl = lda_instance.perplexity(dtm_valid)
```

This is done for a set of parameters, basically for a varying number of
topics (n_components).

I tried this out with a number of different data sets, for example with
the "Associated Press" data mentioned in [1], which is the sample data
for David M. Blei's LDA C implementation [2].
Using the same data, I would expect that I get similar results as in
[1], which found that a model with ~100 topics fits the AP data best.
However, my experiments always show that the perplexity is exponentially
growing with the number of topics. The "best" model is always the one
with the lowest number of topics. The same happens with other data sets,
too. Similar results happen when calculating the perplexity on the full
training data alone (so no cross validation on held-out data).

Does anyone have an idea why these results are not consistent with those
from [1]? Is the perplexity() method not the correct method to use when
evaluating held-out data? Could it be a problem, that some of the
columns of the training data term frequency matrix are all-zero?

Best,
Markus

[1] http://ellisp.github.io/blog/2017/01/05/topic-model-cv
[2]
https://web.archive.org/web/20160930175144/http://www.cs.princeton.edu/~blei/lda-c/index.html