Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random)

Mon Dec 21 03:47:43 EST 2015

On Monday 21 December 2015 14:45, Ben Finney wrote:

> Steven D'Aprano <steve at pearwood.info> writes:
> 
>> Let's call the second group "random" and the first "non-random",
>> without getting bogged down into arguments about whether they are
>> really random or not.
> 
> I think we should discuss it, even at risk of getting bogged down. As
> you know better than I, “random” is not an observable property of the
> value, but of the process that produced it.
> 
> So, I don't think “random” is at all helpful as a descriptor of the
> criteria you need for discriminating these values.
> 
> Can you give a better definition of what criteria distinguish the
> values, based only on their observable properties?

No, not really. This *literally* is a case of "I'll know it when I see it", 
which suggests that some sort of machine-learning solution (neural network?) 
may be useful. I can train it on a bunch of strings which I can hand-
classify, and let the machine pick out the correlations, then apply it to 
the rest of the strings.

The best I can say is that the "non-random" strings either are, or consist 
of, mostly English words, names, or things which look like they might be 
English words, containing no more than a few non-ASCII characters, 
punctuation, or digits.

> You used “meaningless”; that seems at least more hopeful as a criterion
> we can use by examining text values. So, what counts as meaningless?

Strings made up of random-looking sequences of characters, like you often 
see on sites like imgur or tumblr. Characters from non-Latin character sets 
that I can't read (e.g. Japanese, Korean, Arabic, etc). Jumbled up words, 
e.g. "python" is non-random, "nyohtp" would be random.

[...]
> Perhaps you could measure Shannon entropy (“expected information value”)
> <URL:https://en.wikipedia.org/wiki/Entropy_%28information_theory%29> as
> a proxy? Or maybe I don't quite understand the criteria.

That's a possibility. At least, it might be able to distinguish some 
strings, although if I understand correctly, the two strings "python" and 
"nhoypt" have identical entropy, so this alone won't be sufficient.

-- 
Steve