Comparing strings from the back?
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Sep 7 00:06:42 EDT 2012
On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:
> On 09/06/2012 04:33 AM, Steven D'Aprano wrote:
>> <snip>
>>
>> I may have been overly-conservative earlier when I said that on average
>> string equality has to compare half the characters. I thought I had
>> remembered that from a computer science textbook, but I can't find that
>> reference now, so possibly I was thinking of something else. (String
>> searching perhaps?). In any case, the *worst* case for string equality
>> testing is certainly O(N) (every character must be looked at), and the
>> *best* case is O(1) obviously (the first character fails to match). But
>> I'm not so sure about the average case. Further thought is required.
>
> For random strings (as defined below), the average compare time is
> effectively unrelated to the size of the string, once the size passes
> some point.
Okay, I've spent some time looking for the reference, and can't find it,
so I'm now sure I must have been conflating it with something else.
After further thought, and giving consideration to the arguments given by
people here, I'm now satisfied to say that for equal-length strings,
string equality is best described as O(N).
1) If the strings are equal, a == b will always compare all N
characters in each string.
2) If the strings are unequal, a == b will *at worst* compare
all N characters.
3) Following usual practice in this field, worst case is the
one which conventionally is meant when discussing Big Oh
behaviour. See, for example, "Introduction To Algorithms"
by Cormen, Leiserson and Rivest.
Also of some interest is the best case: O(1) for unequal strings (they
differ at the first character) and O(N) for equal strings.
Also of interest is the case that has caused the majority of the
discussion, the average case. I am now satisfied that the average number
of comparisons for unequal strings is O(1). To be precise, it is bounded
below by 1 comparison (you always have to compare at least one pair of
characters) and bounded above by 2 comparisons.
(I'm talking about the average here -- the actual number of comparisons
can range all the way up to N, but the average is <= 2.)
If I've done the maths right, the exact value for the average is:
((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)
for random strings of length N taken from an alphabet of size M.
For M = 2, that average approaches but never exceeds 2 as N increases;
for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333...
and so forth.
Thanks to all who contributed.
--
Steven
More information about the Python-list
mailing list