Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random)
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Dec 21 03:47:43 EST 2015
On Monday 21 December 2015 14:45, Ben Finney wrote:
> Steven D'Aprano <steve at pearwood.info> writes:
>
>> Let's call the second group "random" and the first "non-random",
>> without getting bogged down into arguments about whether they are
>> really random or not.
>
> I think we should discuss it, even at risk of getting bogged down. As
> you know better than I, “random” is not an observable property of the
> value, but of the process that produced it.
>
> So, I don't think “random” is at all helpful as a descriptor of the
> criteria you need for discriminating these values.
>
> Can you give a better definition of what criteria distinguish the
> values, based only on their observable properties?
No, not really. This *literally* is a case of "I'll know it when I see it",
which suggests that some sort of machine-learning solution (neural network?)
may be useful. I can train it on a bunch of strings which I can hand-
classify, and let the machine pick out the correlations, then apply it to
the rest of the strings.
The best I can say is that the "non-random" strings either are, or consist
of, mostly English words, names, or things which look like they might be
English words, containing no more than a few non-ASCII characters,
punctuation, or digits.
> You used “meaningless”; that seems at least more hopeful as a criterion
> we can use by examining text values. So, what counts as meaningless?
Strings made up of random-looking sequences of characters, like you often
see on sites like imgur or tumblr. Characters from non-Latin character sets
that I can't read (e.g. Japanese, Korean, Arabic, etc). Jumbled up words,
e.g. "python" is non-random, "nyohtp" would be random.
[...]
> Perhaps you could measure Shannon entropy (“expected information value”)
> <URL:https://en.wikipedia.org/wiki/Entropy_%28information_theory%29> as
> a proxy? Or maybe I don't quite understand the criteria.
That's a possibility. At least, it might be able to distinguish some
strings, although if I understand correctly, the two strings "python" and
"nhoypt" have identical entropy, so this alone won't be sufficient.
--
Steve
More information about the Python-list
mailing list