Catogorising strings into random versus non-random

Mon Dec 21 04:56:24 EST 2015

Am 21.12.15 um 09:24 schrieb Peter Otten:
> Steven D'Aprano wrote:
>
>> I have a large number of strings (originally file names) which tend to
>> fall into two groups. Some are human-meaningful, but not necessarily
>> dictionary words e.g.:
>>
>>
>> baby lions at play
>> saturday_morning12
>> Fukushima
>> ImpossibleFork
>>
>>
>> (note that some use underscores, others spaces, and some CamelCase) while
>> others are completely meaningless (or mostly so):
>>
>>
>> xy39mGWbosjY
>> 9sjz7s8198ghwt
>> rz4sdko-28dbRW00u
>>
>>
>> Let's call the second group "random" and the first "non-random", without
>> getting bogged down into arguments about whether they are really random or
>> not. I wish to process the strings and automatically determine whether
>> each string is random or not. I need to split the strings into three
>> groups:
>>
>> - those that I'm confident are random
>> - those that I'm unsure about
>> - those that I'm confident are non-random
>>
>> Ideally, I'll get some sort of numeric score so I can tweak where the
>> boundaries fall.
>>
>> Strings are *mostly* ASCII but may include a few non-ASCII characters.
>>
>> Note that false positives (detecting a meaningful non-random string as
>> random) is worse for me than false negatives (miscategorising a random
>> string as non-random).
>>
>> Does anyone have any suggestions for how to do this? Preferably something
>> already existing. I have some thoughts and/or questions:
>>
>> - I think nltk has a "language detection" function, would that be
>> suitable?
>>
>> - If not nltk, are there are suitable language detection libraries?
>>
>> - Is this the sort of problem that neural networks are good at solving?
>> Anyone know a really good tutorial for neural networks in Python?
>>
>> - How about Bayesian filters, e.g. SpamBayes?
>
> A dead simple approach -- look at the pairs in real words and calculate the
> ratio
>
> pairs-also-found-in-real-words/num-pairs

Sounds reasonable. Building on this approach, two simple improvements:
- calculate the log-likelihood instead, which also makes use of the 
frequency of the digraphs in the training set
- Use trigraphs instead of digraphs
- preprocess the string (lowercase), but more sophisticated 
preprocessing could be an option (i.e. converting under_scores and 
CamelCase to spaces)

The main reason for the low score of the baby lions is the space 
character, I think - the word list does not contain that much spaces. 
Maybe one should feed in some long wikipedia article to calculate the 
digraph/trigraph probabilities

=====================================
Apfelkiste:Tests chris$ cat score_my.py
from __future__ import division
from collections import Counter, defaultdict
from math import log
import sys
WORDLIST = "/usr/share/dict/words"

SAMPLE = """\
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
""".splitlines()

def extract_pairs(text):
     for i in range(len(text)-1):
         yield text.lower()[i:i+2]
     # or len(text)-2 and i:i+3

def load_pairs():
     pairs = Counter()
     with open(WORDLIST) as f:
         for line in f:
             pairs.update(extract_pairs(line.strip()))
     # normalize to sum
     total_count = sum([pairs[x] for x in pairs])
     N = total_count+len(pairs)
     dist = defaultdict(lambda:1/N, ((x, (pairs[x]+1)/N) for x in pairs))
     return dist

def get_score(text, dist):
     ll    = 0
     for i, x in enumerate(extract_pairs(text), 1):
         ll += log(dist[x])
     return ll / i

def main():
     pair_dist = load_pairs()
     for text in sys.argv[1:] or SAMPLE:
         score = get_score(text, pair_dist)
         print("%.3g  %s" % (score, text))

if __name__ == "__main__":
     main()

Apfelkiste:Tests chris$ python score_my.py
-8.74  baby lions at play
-7.63  saturday_morning12
-6.38  Fukushima
-5.72  ImpossibleFork
-10.6  xy39mGWbosjY
-12.9  9sjz7s8198ghwt
-12.1  rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-9.43  bnsip atl ayba loy
Apfelkiste:Tests chris$

and using trigraphs:

Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-12.5  bnsip atl ayba loy
Apfelkiste:Tests chris$ python score_my.py
-11.5  baby lions at play
-9.88  saturday_morning12
-9.85  Fukushima
-7.68  ImpossibleFork
-13.4  xy39mGWbosjY
-14.2  9sjz7s8198ghwt
-14.2  rz4sdko-28dbRW00u
==============================