Catogorising strings into random versus non-random

Mon Dec 21 08:25:28 EST 2015

2015-12-21 4:01 GMT+01:00 Steven D'Aprano <steve at pearwood.info>:
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
>
>
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
>
>
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
>
>
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
>
>
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether each
> string is random or not. I need to split the strings into three groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
>
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
>
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
>
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
>
> - I think nltk has a "language detection" function, would that be suitable?
>
> - If not nltk, are there are suitable language detection libraries?
>
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
>
> - How about Bayesian filters, e.g. SpamBayes?
>
>
>
>
> --
> Steven
>
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
as you probably already know, NLTK could be helpful for some parts of
this task; if you can handle the most likely "word" splitting involved
by underscores, CamelCase etc., you could try to tag the parts of
speech of the words and interpret for the results according to your
needs.
In the online demo
http://text-processing.com/demo/tag/
your sample (with different approaches to splitt the words) yields:

baby/NN lions/NNS at/IN play/VB saturday/NN morning/NN 12/CD
Fukushima/NNP Impossible/JJ Fork/NNP xy39mGWbosjY/-None-
9sjz7s8198ghwt/-None- rz4sdko/-None- -/: 28dbRW00u/-None-

or with more splittings on case or letter-digit boundaries:
baby/NN lions/NNS at/IN play/VB saturday/NN morning/NN 12/CD
Fukushima/NNP Impossible/JJ Fork/NNP xy/-None- 39/CD m/-None- G/NNP
Wbosj/-None- Y/-None- 9/CD sjz/-None- 7/CD s/-None- 8198/-NONE-
ghwt/-None- rz/-None- 4/CD sdko/-None- -/: 28/CD db/-None- R/NNP
W/-None- 00/-None- u/-None-

 the tagset might be compatible with
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

There is sample code with a comparable output to this demo:
http://stackoverflow.com/questions/23953709/how-do-i-tag-a-sentence-with-the-brown-or-conll2000-tagger-chunker

For the given minimal sample, the results look useful (maybe with
exception of the capitalised words sometimes tagged as proper names -
but it might not be that relevant here).
Of course, any scoring isn't available with this approach, but you
could maybe check the proportion of the  recognised "words" comparing
to the total number of the "words" for the respective filename.
Training the tagger should be possible too in NLTK, but I don't have
experiences with this.

regards,
     vbr