Comparing strings from the back?

Thu Sep 6 07:04:19 EDT 2012

On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel <d at davea.name> wrote:
> For random strings (as defined below), the average compare time is
> effectively unrelated to the size of the string, once the size 
passes
> some point.

> Define random string as being a selection from a set of characters, 
with
> replacement.  So if we pick some set of characters, say 10 (or 256, 
it
> doesn't really matter), the number of possible strings is 10**N.

> The likelihood of not finding a mismatch within k characters is  
> (1/10)**k   The likelihood of actually reaching the end of the 
random
> string is (1/10)**N.  (which is the reciprocal of the number of 
strings,
> naturally)

> If we wanted an average number of comparisons, we'd have to 
calculate a
> series, where each term is a probability times a value for k.
>    sum((k * 9*10**-k) for k in range(1, 10))

> Those terms very rapidly approach 0, so it's safe to stop after a 
few. 
> Looking at the first 9 items, I see a value of 1.1111111

> This may not be quite right, but the value is certainly well under 
2 for
> a population of 10 characters, chosen randomly.  And notice that N
> doesn't really come into it.

It's exactly right. You can obtain this result analytically from 
Johannes' formula above. Just replace 256 with 10 to get that the 
expected number of comparisons is

(10/9)*(1 - 10**(-N))

The last term shows the dependence on N and is tiny even for N=9.

Oscar