Catogorising strings into random versus non-random
Steven D'Aprano
steve at pearwood.info
Sun Dec 20 22:01:48 EST 2015
I have a large number of strings (originally file names) which tend to fall
into two groups. Some are human-meaningful, but not necessarily dictionary
words e.g.:
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
(note that some use underscores, others spaces, and some CamelCase) while
others are completely meaningless (or mostly so):
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
Let's call the second group "random" and the first "non-random", without
getting bogged down into arguments about whether they are really random or
not. I wish to process the strings and automatically determine whether each
string is random or not. I need to split the strings into three groups:
- those that I'm confident are random
- those that I'm unsure about
- those that I'm confident are non-random
Ideally, I'll get some sort of numeric score so I can tweak where the
boundaries fall.
Strings are *mostly* ASCII but may include a few non-ASCII characters.
Note that false positives (detecting a meaningful non-random string as
random) is worse for me than false negatives (miscategorising a random
string as non-random).
Does anyone have any suggestions for how to do this? Preferably something
already existing. I have some thoughts and/or questions:
- I think nltk has a "language detection" function, would that be suitable?
- If not nltk, are there are suitable language detection libraries?
- Is this the sort of problem that neural networks are good at solving?
Anyone know a really good tutorial for neural networks in Python?
- How about Bayesian filters, e.g. SpamBayes?
--
Steven
More information about the Python-list
mailing list