Catogorising strings into random versus non-random

Chris Angelico rosuav at gmail.com
Sun Dec 20 23:22:39 EST 2015


On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
>
>
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
>
>
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
>
>
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
>
> I need to split the strings into three groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.

The first thing that comes to my mind is poking the string into a
search engine and seeing how many results come back. You might need to
do some preprocessing to recognize multi-word forms (maybe a handful
of recognized cases like snake_case, CamelCase,
CamelCasewiththeLittleWordsLeftUnchanged, etc), but doing that
manually on the above text gives me:

* baby lions at play
* saturday morning 12
* fukushima
* impossible fork
* xy 39 mgwbosjy
* 9 sjz 7 s 8198 ghwt
* rz 4 sdko 28 dbrw 00 u

Putting those into Google without quotes yields:

* About 23,800,000 results
* About 227,000,000 results
* About 32,500,000 results
* About 16,400,000 results
* About 1,180 results
* 7 results
* About 30,300 results

DuckDuckGo doesn't give a result count, so I skipped it. Yahoo search yielded:

* 6,040,000 results
* 123,000,000 results
* 3,920,000 results
* 720,000 results
* No results at all
* No results at all
* 2 results

Bing produces much more chaotic results, though:
* 34,000,000 RESULTS
* 15,600,000 RESULTS
* 11,000,000 RESULTS
* 1,620,000 RESULTS
* 5,720,000 RESULTS
* 1,580,000,000 RESULTS
* 3,380,000 RESULTS

This suggests that search engine results MAY be useful, but in some
cases, tweaks may be necessary (I couldn't force Bing to do phrase
search, for some reason probably related to my inexperience with it),
and also that the boundary between "meaningful" and "non-meaningful"
will depend on the engine used (I'd use 1,000,000 as the boundary with
Google, but probably 100,000 with Yahoo). You might want to handle
numerics differently, too - converting "9" into "nine" could improve
the result reliability.

How many of these keywords would you be looking up, and would a
network transaction (a search engine API call) for each one be too
expensive?

ChrisA



More information about the Python-list mailing list