Comparing strings from the back?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Sep 7 00:06:42 EDT 2012


On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:

> On 09/06/2012 04:33 AM, Steven D'Aprano wrote:
>> <snip>
>>
>> I may have been overly-conservative earlier when I said that on average
>> string equality has to compare half the characters. I thought I had
>> remembered that from a computer science textbook, but I can't find that
>> reference now, so possibly I was thinking of something else. (String
>> searching perhaps?). In any case, the *worst* case for string equality
>> testing is certainly O(N) (every character must be looked at), and the
>> *best* case is O(1) obviously (the first character fails to match). But
>> I'm not so sure about the average case. Further thought is required.
> 
> For random strings (as defined below), the average compare time is
> effectively unrelated to the size of the string, once the size passes
> some point.

Okay, I've spent some time looking for the reference, and can't find it, 
so I'm now sure I must have been conflating it with something else.

After further thought, and giving consideration to the arguments given by 
people here, I'm now satisfied to say that for equal-length strings, 
string equality is best described as O(N).

1) If the strings are equal, a == b will always compare all N 
   characters in each string.

2) If the strings are unequal, a == b will *at worst* compare
   all N characters.

3) Following usual practice in this field, worst case is the
   one which conventionally is meant when discussing Big Oh
   behaviour. See, for example, "Introduction To Algorithms" 
   by Cormen, Leiserson and Rivest.

Also of some interest is the best case: O(1) for unequal strings (they 
differ at the first character) and O(N) for equal strings.

Also of interest is the case that has caused the majority of the 
discussion, the average case. I am now satisfied that the average number 
of comparisons for unequal strings is O(1). To be precise, it is bounded 
below by 1 comparison (you always have to compare at least one pair of 
characters) and bounded above by 2 comparisons.

(I'm talking about the average here -- the actual number of comparisons 
can range all the way up to N, but the average is <= 2.)

If I've done the maths right, the exact value for the average is:

((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

for random strings of length N taken from an alphabet of size M.

For M = 2, that average approaches but never exceeds 2 as N increases; 
for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333... 
and so forth.



Thanks to all who contributed.


-- 
Steven



More information about the Python-list mailing list