Catogorising strings into random versus non-random

Mon Dec 21 11:40:50 EST 2015

On 21/12/15 03:01, Steven D'Aprano wrote:
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
> 
> 
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
> 
> 
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
> 
> 
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
> 
> 
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether each
> string is random or not. I need to split the strings into three groups:
> 
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
> 
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
> 
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
> 
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
> 
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
> 
> - I think nltk has a "language detection" function, would that be suitable?
> 
> - If not nltk, are there are suitable language detection libraries?
> 
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
> 
> - How about Bayesian filters, e.g. SpamBayes?
> 
> 
> 
> 

Finite state machine / transition matrix. Learn from some English text
source. Then process your strings by lower casing, replacing underscores
with spaces, removing trailing numeric characters etc. Base your score
on something like the mean transition probability. I'd expect to see two
pretty well separated groups of scores.

Duncan