Comparing strings from the back?

Wed Sep 5 12:24:27 EDT 2012

On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
[...]
>> You are making unjustified assumptions about the distribution of
>> letters in the words. This might be a list of long chemical compounds
>> where the words typically differ only in their suffix. It might be a
>> list of people with titles:
> 
> Actually, I'm not. I'm stating exactly what assumptions I'm making to
> get my calculation. I'm comparing *random* character strings or
> bitstrings.

Excuse me, you are not. You are comparing English words which are highly 
non-random.

> You, on the other hand, are making vague assumptions which you do not
> care for formalize and yet you claim that "the number of comparisons is
> equally likely to be 1, 2, 3, ..., N. The average then is". Without any
> explanation for this. At all.

I will accept that my explanation was not good enough, but I strongly 
disagree that I gave no explanation at all.

>> Herr Professor Frederick Schmidt
>> Herr Professor Frederick Wagner
>> ...
> 
> Is your assumtion that we're comparing words that have the common prefix
> "Herr Professor Frederick "? 

No, I am pointing out that *your* assumption that most string comparisons 
will halt close to the beginning of the string is an invalid assumption. 
Your assumption only holds for some non-random strings.

[...]
>> I have no idea why you think this program demonstrates anything about
>> string equality comparisons. What you are counting is not the average
>> number of comparisons for string equality, but the number of
>> comparisons when sorting. These are *completely different*, because
>> sorting a list does not require testing each string against every other
>> string.
> 
> You're wrong. It's not counting the number of string comparisons, 

I didn't say it was. 

> it's counting the number of character comparisons.

Correct. But only out of an extremely limited subset of all possible 
string comparisons, namely those very few that happen when sorting lists 
of English words using a very specific algorithm, namely timsort.

> Which is exactly what
> we're talking about in this thread, are we not? The sorting of the list
> is just to show some example where lots of strings are compared.

It is not good enough to extract a non-random subset of strings (only 
English words) and then decrease the data set even further by only 
comparing an extremely non-uniform subset of those strings: specifically 
those few string comparisons that happen to occur using timsort.

Whatever figure you have calculated by taking this non-random selection, 
it is irrelevant to the question of the average performance of string 
equality given random strings.

>>> So, interestingly, in this real-world case(tm), the complexity does
>>> not scale with O(n). It actually goes down (which is probably due to
>>> the limit amount of words of higher length).
>> 
>> I would guess that it probably has more to do with the ability of the
>> timsort algorithm to exploit local order within a list and avoid
>> performing comparisons.
> 
> Again, it's not counting the number of string comparisons. It's counting
> the average number of character comparisons per string comparison. The
> total amount of string comparisons has nothing to do with it.

I didn't say anything about counting string comparisons. But you are 
wrong -- the numbers you get are *strongly* influenced by the number of 
string comparisons performed. That is just one of the reasons why the 
results you get are biased.

Let me spell it out for you: given a list of five letter words:

['fruit', 'apple', 'claim', 'pears', ... , 'toast']

sorting the list may compare:

'fruit' with 'apple'
'apple' with 'claim'

but almost certainly will not compare:

'apple' with 'toast'

and it definitely won't compare:

'zzzax' with 'zzzaq'

because strings like those never get into the list in the first place.

For the sake of simple calculations, let's pretend that there are only 
1000 five-letter strings possible. We want to know how many character-
comparisons are done on average when testing two random five-letter 
strings for equality. To calculate this average, you must compare every 
string to every other string, *including* itself.

This gives 1000*1000 = one million equality tests to be performed. For 
each equality test, we count the number of character comparisons 
performed, sum all those counts, and divide by 1e6. That is the average 
number of char comparisons for random strings of five letters.

But that's not what you do. First you eliminate 900 out of the 1000 
possible strings by only sampling English words -- strings like "xxoij" 
cannot possibly be selected. So you are left with the highly non-random 
subset of 10000 equality tests.

But you haven't finished. Instead of checking all 10000 equality tests, 
you then decide to only check the 2000 tests that happen to randomly 
occur when using the timsort algorithm to sort a list.

So from the population of one million possible equality tests, you throw 
away the vast majority, then average the *strongly non-random* few that 
are remaining. Naturally the result you get is strongly biased, and it 
has nothing to do with the average Big Oh behaviour of equality on random 
strings.

-- 
Steven