Comparing strings from the back?
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Wed Sep 5 10:30:07 EDT 2012
On Wed, 05 Sep 2012 11:43:08 +0200, Johannes Bauer wrote:
> On 05.09.2012 11:24, Johannes Bauer wrote:
>
>> Consider sorting a large dictionary. Sorting is in O(n log(n)), but
>> this assumes O(1) comparisons! If the comparison operation itself were
>> in O(n), the total sorting complexity would be O(n^2 log(n)),
This shows some significant confusion about Big Oh complexity analysis
here. When you say that sorting the list of words is O(N log N), the N
refers to the number of words in the dictionary. But when you speak about
comparisons being O(N), the N doesn't refer to the size of the dict, but
the size of the individual strings being compared. There is no reasonable
comparison algorithm where the effort required to compare two strings
depends on the size of the dictionary they have come from.
Since these are *completely different Ns*, you can't combine them to get
O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer
nonsense. If you state it like this:
sorting the list of words is O(N log N) where N = length of the list
comparing the words is O(M) where M = the average length of the words
then the overall complexity is O(N log N)*O(M), but since M is a constant
for any particular dictionary (its just the average length of the words)
that reduces down to O(N log N).
To say that sorting a dictionary is O(N log N) does *not* imply that
string comparisons are O(1). It only implies that string comparisons
don't depend on the size of the dictionary. Comparing:
s1 = 'a'*30002
s2 = 'a'*30001 + 'b'
takes the same amount of time whether they are in a dictionary of 100
words or one of 100000 words. For the purposes of calculating the
algorithmic complexity of sorting a large list of words, we can treat
comparisons as an elementary operation even though they actually aren't.
>> which is
>> definitely false. Most comparisons will abort *very* early (after the
>> first character).
You are making unjustified assumptions about the distribution of letters
in the words. This might be a list of long chemical compounds where the
words typically differ only in their suffix. It might be a list of people
with titles:
Herr Professor Frederick Schmidt
Herr Professor Frederick Wagner
...
But either way, it doesn't matter for the Big Oh behaviour of sorting the
words.
> I've just written a program to demonstrate this (below it's attached).
I have no idea why you think this program demonstrates anything about
string equality comparisons. What you are counting is not the average
number of comparisons for string equality, but the number of comparisons
when sorting. These are *completely different*, because sorting a list
does not require testing each string against every other string.
Whatever you have measured, it has nothing to do with the Big Oh
complexity of string comparisons.
It seems to me that you have misunderstood the purpose of Big Oh
notation. It gives an asymptotic upper bound on the amount of work done,
ignoring all lower terms and multiplicative factors, not an exact formula.
If something is O(N), then this tells you:
1) the number of algorithmic steps is proportional to N, plus or
minus some terms which are less than N (e.g. sqrt(N), or log(N),
or 1/N, or some constant, etc.);
2) if you ignore those other terms, and keep all other influences
equal, then doubling N will *at worst* double the number of
algorithmic steps;
3) if all those algorithmic steps take exactly the same amount of
time, then doubling N will take twice as much time.
But of course *none* of those assumptions hold for measurements of actual
elapsed time. In real life, operations rarely take constant time, you
don't always see worst case behaviour, and the lower terms are not
necessarily insignificant enough to ignore.
[...]
> So, interestingly, in this real-world case(tm), the complexity does not
> scale with O(n). It actually goes down (which is probably due to the
> limit amount of words of higher length).
I would guess that it probably has more to do with the ability of the
timsort algorithm to exploit local order within a list and avoid
performing comparisons.
> In summary, let's please not debate "feelings" or such.
I have no idea what feelings you are referring to.
--
Steven
More information about the Python-list
mailing list