Comparing strings from the back?

Tue Sep 4 14:40:19 EDT 2012

On 4 September 2012 19:07, Steven D'Aprano <
steve+comp.lang.python at pearwood.info> wrote:

> On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote:
>
> > On 04.09.2012 04:17, Steven D'Aprano wrote:
> >
> >> On average, string equality needs to check half the characters in the
> >> string.
> >
> > How do you arrive at that conclusion?
>
> Take two non-empty strings of the same length, N. If the strings are
> equal, you have to make N comparisons to be sure they are equal (you
> check all N pairs of characters). If they are unequal, you have to check
> each pair of characters up to the first pair that are different:
>
> def string_equality(astr, bstr):
>     # Assumes lengths are equal.
>     for achr, bchr in zip(astr, bstr):
>         if achr != bchr:
>             return False
>     return True
>
> For the unequal case, how many comparisons do you do? Ahead of time, we
> know very little about the strings. We don't even know how many possible
> characters there are (are they English alphanumeric, ASCII, Greek, Thai,
> or from the full range of 1114111 Unicode code points?) or what their
> probability distribution is.
>
> A reasonable, conservative assumption is to calculate the largest
> possible value of the average for random strings. That largest value
> occurs when the alphabet is as small as possible, namely two characters.
> In practice, strings come from a larger alphabet, up to 1114111 different
> characters for full Unicode strings, so the average for them will be less
> than the average we calculate now.
>
> So for unequal strings, the number of comparisons is equally likely to be
> 1, 2, 3, ..., N. The average then is:

What?

>
> sum([1, 2, 3, ..., N])/N
>
> which by a bit of simple algebra works out to be (N+1)/2, or half the
> characters as I said.
>
> (Note that this average assumes the strings are completely random. In
> practice, strings tend to come from strongly non-uniform distributions of
> characters. For instance, in English texts, 'e' is much more likely than
> 'q' and most characters are vanishingly rare, and in practice, strings
> are likely to be highly non-random.)

If the strings are 'completely random' (by which I assume you mean that
each character is IID) then the probability of a match for the character at
any one index is the same as the probability for a match at any other
index. Lets say the probability for a match is p and that p < 1.

Then for the first comparison:
1) with probability (1 - p) we terminate the loop after 1 comparison.
2) With probability p we continue to the second comparison

The second comparison occurs with probability p (from 2 above) and if we
reach this point then:
1) with probability (1 - p) we terminate the loop after this second
comparison
2) With probability p we continue to the third comparison

The probability of reaching the second comparison is p and the probability
of terminating at this comparison *if we reach it* is (1-p). So the
probability from the outset that we terminate at the second comparison is
p*(1 - p).

Prob(1 comparison) = (1-p) > p*(1-p) = prob(2 comparisons)

(since p < 1)

This can easily be extended by induction or otherwise to show that the
probability of terminating after N comparisons decreases as N increases. In
fact since it decreases by a factor of p each time, even if p is 1/2 (as
would be the case if there were two equally likely characters) the
probability of continuing to N comparisons becomes vanishingly small very
quickly as N increases. In practise p is larger for mast ascii/bmp/unicode
problems so the probability vanishes even more quickly.

If you follow this through you should arrive at Johannes' result
above. Applying this to real data is obviously more complicated since
successive characters are not independent but it would be a fairly unusual
situation if you were not more likely to stop at an earlier point in the
comparison loop than a later one.

Oscar.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20120904/57b02b47/attachment.html>