Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random)

Sun Dec 20 22:45:31 EST 2015

Steven D'Aprano <steve at pearwood.info> writes:

> Let's call the second group "random" and the first "non-random",
> without getting bogged down into arguments about whether they are
> really random or not.

I think we should discuss it, even at risk of getting bogged down. As
you know better than I, “random” is not an observable property of the
value, but of the process that produced it.

So, I don't think “random” is at all helpful as a descriptor of the
criteria you need for discriminating these values.

Can you give a better definition of what criteria distinguish the
values, based only on their observable properties?

You used “meaningless”; that seems at least more hopeful as a criterion
we can use by examining text values. So, what counts as meaningless?

> I wish to process the strings and automatically determine whether each
> string is random or not. I need to split the strings into three groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.

Perhaps you could measure Shannon entropy (“expected information value”)
<URL:https://en.wikipedia.org/wiki/Entropy_%28information_theory%29> as
a proxy? Or maybe I don't quite understand the criteria.

-- 
 \      “Actually I made up the term “object-oriented”, and I can tell |
  `\            you I did not have C++ in mind.” —Alan Kay, creator of |
_o__)                                        Smalltalk, at OOPSLA 1997 |
Ben Finney