Catogorising strings into random versus non-random
Christian Gollwitzer
auriocus at gmx.de
Mon Dec 21 05:56:05 EST 2015
Am 21.12.15 um 11:53 schrieb Christian Gollwitzer:
> So for the spaces, either use a proper trainig material (some long
> corpus from Wikipedia or such), with punctuation removed. Then it will
> catch the correct probabilities at word boundaries. Or preprocess by
> removing the spaces.
>
> Christian
PS: The real log-likelihood would become -infinity, when some pair does
not appear at all in the training set (esp. the numbers, e.g.). I used
the 1/total in the defaultdict to mitigate that. You could tweak that
value a bit. The larger the corpus, the sharper it will divide by
itself, too.
Christian
More information about the Python-list
mailing list