Catogorising strings into random versus non-random

Mon Dec 21 05:56:05 EST 2015

Am 21.12.15 um 11:53 schrieb Christian Gollwitzer:
> So for the spaces, either use a proper trainig material (some long
> corpus from Wikipedia or such), with punctuation removed. Then it will
> catch the correct probabilities at word boundaries. Or preprocess by
> removing the spaces.
>
>      Christian

PS: The real log-likelihood would become -infinity, when some pair does 
not appear at all in the training set (esp. the numbers, e.g.). I used 
the 1/total in the defaultdict to mitigate that. You could tweak that 
value a bit. The larger the corpus, the sharper it will divide by 
itself, too.

	Christian