Comparing strings from the back?

Fri Sep 7 15:10:16 EDT 2012

On 2012-09-07, Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
> <snip>
>
> After further thought, and giving consideration to the arguments given by 
> people here, I'm now satisfied to say that for equal-length strings, 
> string equality is best described as O(N).
>
> 1) If the strings are equal, a == b will always compare all N 
>    characters in each string.
>
> 2) If the strings are unequal, a == b will *at worst* compare
>    all N characters.
>
> 3) Following usual practice in this field, worst case is the
>    one which conventionally is meant when discussing Big Oh
>    behaviour. See, for example, "Introduction To Algorithms" 
>    by Cormen, Leiserson and Rivest.

Would you say, then, that dict insertion is O(N)?

>
> Also of some interest is the best case: O(1) for unequal strings (they 
> differ at the first character) and O(N) for equal strings.
>
> Also of interest is the case that has caused the majority of the 
> discussion, the average case. I am now satisfied that the average number 
> of comparisons for unequal strings is O(1). To be precise, it is bounded 
> below by 1 comparison (you always have to compare at least one pair of 
> characters) and bounded above by 2 comparisons.

I find this idea of separating into the comparison of equal strings versus the
comparison of unequal strings rather odd. If the strings you compare come from
a distribution where they are guaranteed to be equal (or unequal) then you can
just use the O(0) comparison method.

Since string comparison is only useful if the strings can be equal or unequal,
the average case depends on how often they are equal/unequal as well as the
average complexity of both. For random strings the frequency of equal strings
decreases very fast as N increases so that the comparison of random strings is
O(1).

>
> (I'm talking about the average here -- the actual number of comparisons 
> can range all the way up to N, but the average is <= 2.)
>
> If I've done the maths right, the exact value for the average is:
>
> ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

I'm not sure where the extra N comes from --------^ but otherwise good.

I would have written that as:

(1 - p) * sum(i * p**(i-1) for i in range(1, N+1))

where p is the probability of a match (1/M for M equally likely characters) or
in closed form:

 ⎛     N                 ⎞
 ⎝1 - p ⋅(1 + N ⋅(1 - p))⎠
 ─────────────────────────
           1 - p

>
> for random strings of length N taken from an alphabet of size M.
>
> For M = 2, that average approaches but never exceeds 2 as N increases; 
> for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333... 
> and so forth.

It approaches 1 / (1 - p) or, if you prefer: M / (M - 1)

Oscar