Catogorising strings into random versus non-random

Christian Gollwitzer auriocus at gmx.de
Mon Dec 21 05:53:12 EST 2015


Am 21.12.15 um 11:36 schrieb Steven D'Aprano:
> On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote:
>
>> Apfelkiste:Tests chris$ python score_my.py
>> -8.74  baby lions at play
>> -7.63  saturday_morning12
>> -6.38  Fukushima
>> -5.72  ImpossibleFork
>> -10.6  xy39mGWbosjY
>> -12.9  9sjz7s8198ghwt
>> -12.1  rz4sdko-28dbRW00u
>> Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
>> -9.43  bnsip atl ayba loy
>
> Thanks Christian and Peter for the suggestion, I'll certainly investigate
> this further.
>
> But the scoring doesn't seem very good. "baby lions at play" is 100% English
> words, and ought to have a radically different score from (say)
> xy39mGWbosjY which is extremely non-English like. (How many English words
> do you know of with W, X, two Y, and J?) And yet they are only two units
> apart. "baby lions..." is a score almost as negative as the authentic
> gibberish, while Fukushima (a Japanese word) has a much less negative
> score.

It is the spaces, which do not occur in the training wordlist (I 
mentioned that above, maybe not prominently enough). 
/usr/share/dict/words contains one word per line. The underscore _ is 
probably putting the saturday morning low, while the spaces put the 
babies low. Using trigraphs:


Apfelkiste:Tests chris$ python score_my.py
-11.5  baby lions at play
-9.88  saturday_morning12
-9.85  Fukushima
-7.68  ImpossibleFork
-13.4  xy39mGWbosjY
-14.2  9sjz7s8198ghwt
-14.2  rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'babylionsatplay'
-8.74  babylionsatplay
Apfelkiste:Tests chris$ python score_my.py 'saturdaymorning12'
-8.93  saturdaymorning12
Apfelkiste:Tests chris$

So for the spaces, either use a proper trainig material (some long 
corpus from Wikipedia or such), with punctuation removed. Then it will 
catch the correct probabilities at word boundaries. Or preprocess by 
removing the spaces.

	Christian



More information about the Python-list mailing list