Comparing strings from the back?

Wed Sep 5 10:30:07 EDT 2012

On Wed, 05 Sep 2012 11:43:08 +0200, Johannes Bauer wrote:

> On 05.09.2012 11:24, Johannes Bauer wrote:
> 
>> Consider sorting a large dictionary. Sorting is in O(n log(n)), but
>> this assumes O(1) comparisons! If the comparison operation itself were
>> in O(n), the total sorting complexity would be O(n^2 log(n)), 

This shows some significant confusion about Big Oh complexity analysis 
here. When you say that sorting the list of words is O(N log N), the N 
refers to the number of words in the dictionary. But when you speak about 
comparisons being O(N), the N doesn't refer to the size of the dict, but 
the size of the individual strings being compared. There is no reasonable 
comparison algorithm where the effort required to compare two strings 
depends on the size of the dictionary they have come from.

Since these are *completely different Ns*, you can't combine them to get 
O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer 
nonsense. If you state it like this:

sorting the list of words is O(N log N) where N = length of the list
comparing the words is O(M) where M = the average length of the words

then the overall complexity is O(N log N)*O(M), but since M is a constant 
for any particular dictionary (its just the average length of the words) 
that reduces down to O(N log N).

To say that sorting a dictionary is O(N log N) does *not* imply that 
string comparisons are O(1). It only implies that string comparisons 
don't depend on the size of the dictionary. Comparing:

s1 = 'a'*30002
s2 = 'a'*30001 + 'b'

takes the same amount of time whether they are in a dictionary of 100 
words or one of 100000 words. For the purposes of calculating the 
algorithmic complexity of sorting a large list of words, we can treat 
comparisons as an elementary operation even though they actually aren't.

>> which is
>> definitely false. Most comparisons will abort *very* early (after the
>> first character).

You are making unjustified assumptions about the distribution of letters 
in the words. This might be a list of long chemical compounds where the 
words typically differ only in their suffix. It might be a list of people 
with titles:

Herr Professor Frederick Schmidt
Herr Professor Frederick Wagner
...

But either way, it doesn't matter for the Big Oh behaviour of sorting the 
words.

> I've just written a program to demonstrate this (below it's attached).

I have no idea why you think this program demonstrates anything about 
string equality comparisons. What you are counting is not the average 
number of comparisons for string equality, but the number of comparisons 
when sorting. These are *completely different*, because sorting a list 
does not require testing each string against every other string.

Whatever you have measured, it has nothing to do with the Big Oh 
complexity of string comparisons.

It seems to me that you have misunderstood the purpose of Big Oh 
notation. It gives an asymptotic upper bound on the amount of work done, 
ignoring all lower terms and multiplicative factors, not an exact formula.

If something is O(N), then this tells you:

1) the number of algorithmic steps is proportional to N, plus or
   minus some terms which are less than N (e.g. sqrt(N), or log(N), 
   or 1/N, or some constant, etc.);

2) if you ignore those other terms, and keep all other influences
   equal, then doubling N will *at worst* double the number of
   algorithmic steps;

3) if all those algorithmic steps take exactly the same amount of
   time, then doubling N will take twice as much time.

But of course *none* of those assumptions hold for measurements of actual 
elapsed time. In real life, operations rarely take constant time, you 
don't always see worst case behaviour, and the lower terms are not 
necessarily insignificant enough to ignore.

[...]
> So, interestingly, in this real-world case(tm), the complexity does not
> scale with O(n). It actually goes down (which is probably due to the
> limit amount of words of higher length).

I would guess that it probably has more to do with the ability of the 
timsort algorithm to exploit local order within a list and avoid 
performing comparisons.

> In summary, let's please not debate "feelings" or such.

I have no idea what feelings you are referring to.

-- 
Steven