Catogorising strings into random versus non-random
Peter Otten
__peter__ at web.de
Mon Dec 21 03:24:24 EST 2015
Steven D'Aprano wrote:
> I have a large number of strings (originally file names) which tend to
> fall into two groups. Some are human-meaningful, but not necessarily
> dictionary words e.g.:
>
>
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
>
>
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
>
>
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
>
>
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether
> each string is random or not. I need to split the strings into three
> groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
>
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
>
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
>
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
>
> - I think nltk has a "language detection" function, would that be
> suitable?
>
> - If not nltk, are there are suitable language detection libraries?
>
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
>
> - How about Bayesian filters, e.g. SpamBayes?
A dead simple approach -- look at the pairs in real words and calculate the
ratio
pairs-also-found-in-real-words/num-pairs
$ cat score.py
import sys
WORDLIST = "/usr/share/dict/words"
SAMPLE = """\
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
""".splitlines()
def extract_pairs(text):
for i in range(len(text)-1):
yield text[i:i+2]
def load_pairs():
pairs = set()
with open(WORDLIST) as f:
for line in f:
pairs.update(extract_pairs(line.strip()))
return pairs
def get_score(text, popular_pairs):
m = 0
for i, p in enumerate(extract_pairs(text), 1):
if p in popular_pairs:
m += 1
return m/i
def main():
popular_pairs = load_pairs()
for text in sys.argv[1:] or SAMPLE:
score = get_score(text, popular_pairs)
print("%4.2f %s" % (score, text))
if __name__ == "__main__":
main()
$ python3 score.py
0.65 baby lions at play
0.76 saturday_morning12
1.00 Fukushima
0.92 ImpossibleFork
0.36 xy39mGWbosjY
0.31 9sjz7s8198ghwt
0.31 rz4sdko-28dbRW00u
However:
$ python3 -c 'import random, sys; a = list(sys.argv[1]); random.shuffle(a);
print("".join(a))' 'baby lions at play'
bnsip atl ayba loy
$ python3 score.py 'bnsip atl ayba loy'
0.65 bnsip atl ayba loy
More information about the Python-list
mailing list