Comparing strings from the back?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue Sep 4 14:07:29 EDT 2012


On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote:

> On 04.09.2012 04:17, Steven D'Aprano wrote:
> 
>> On average, string equality needs to check half the characters in the
>> string.
> 
> How do you arrive at that conclusion? 

Take two non-empty strings of the same length, N. If the strings are 
equal, you have to make N comparisons to be sure they are equal (you 
check all N pairs of characters). If they are unequal, you have to check 
each pair of characters up to the first pair that are different:

def string_equality(astr, bstr):
    # Assumes lengths are equal.
    for achr, bchr in zip(astr, bstr):
        if achr != bchr:
            return False
    return True

For the unequal case, how many comparisons do you do? Ahead of time, we 
know very little about the strings. We don't even know how many possible 
characters there are (are they English alphanumeric, ASCII, Greek, Thai, 
or from the full range of 1114111 Unicode code points?) or what their 
probability distribution is.

A reasonable, conservative assumption is to calculate the largest 
possible value of the average for random strings. That largest value 
occurs when the alphabet is as small as possible, namely two characters. 
In practice, strings come from a larger alphabet, up to 1114111 different 
characters for full Unicode strings, so the average for them will be less 
than the average we calculate now.

So for unequal strings, the number of comparisons is equally likely to be 
1, 2, 3, ..., N. The average then is:

sum([1, 2, 3, ..., N])/N

which by a bit of simple algebra works out to be (N+1)/2, or half the 
characters as I said.

(Note that this average assumes the strings are completely random. In 
practice, strings tend to come from strongly non-uniform distributions of 
characters. For instance, in English texts, 'e' is much more likely than 
'q' and most characters are vanishingly rare, and in practice, strings 
are likely to be highly non-random.)



-- 
Steven



More information about the Python-list mailing list