Comparing strings from the back?

Thu Sep 6 06:07:38 EDT 2012

On 09/06/2012 04:33 AM, Steven D'Aprano wrote:
> <snip>
>
> I may have been overly-conservative earlier when I said that on
> average string equality has to compare half the characters. I thought
> I had remembered that from a computer science textbook, but I can't
> find that reference now, so possibly I was thinking of something else.
> (String searching perhaps?). In any case, the *worst* case for string
> equality testing is certainly O(N) (every character must be looked
> at), and the *best* case is O(1) obviously (the first character fails
> to match). But I'm not so sure about the average case. Further thought
> is required. 

For random strings (as defined below), the average compare time is
effectively unrelated to the size of the string, once the size passes
some point.

Define random string as being a selection from a set of characters, with
replacement.  So if we pick some set of characters, say 10 (or 256, it
doesn't really matter), the number of possible strings is 10**N.

The likelihood of not finding a mismatch within k characters is  
(1/10)**k   The likelihood of actually reaching the end of the random
string is (1/10)**N.  (which is the reciprocal of the number of strings,
naturally)

If we wanted an average number of comparisons, we'd have to calculate a
series, where each term is a probability times a value for k.
   sum((k * 9*10**-k) for k in range(1, 10))

Those terms very rapidly approach 0, so it's safe to stop after a few. 
Looking at the first 9 items, I see a value of 1.1111111

This may not be quite right, but the value is certainly well under 2 for
a population of 10 characters, chosen randomly.  And notice that N
doesn't really come into it. 

-- 

DaveA