Comparing strings from the back?

Wed Sep 5 05:17:33 EDT 2012

On 04.09.2012 20:07, Steven D'Aprano wrote:
> A reasonable, conservative assumption is to calculate the largest 
> possible value of the average for random strings. That largest value 
> occurs when the alphabet is as small as possible, namely two characters. 
> In practice, strings come from a larger alphabet, up to 1114111 different 
> characters for full Unicode strings, so the average for them will be less 
> than the average we calculate now.
> 
> So for unequal strings, the number of comparisons is equally likely to be 
> 1, 2, 3, ..., N. The average then is:

That is exactly what I don't agree with. For unequal random strings, the
distribution is *not* equally likely to have the same amount of
comparisons. For demonstration, let's look at bitstrings and the string
"1010". There 15 bitstrings of same length which are unequal and I'll
put the number of comparisons needed next to them going left to right:

0000  1
0001  1
0010  1
0011  1
0100  1
0101  1
0110  1
0111  1
1100  2
1101  2
1110  2
1111  2
1000  3
1001  3
1011  4

i.e. about 1.73 (26/15) comparisons in average with random strings
compared to the 2.5 comparisons you end up with. My formula does respect
lazy termination (because the probabilty of higher comparisons gets
exponentially lower).

> sum([1, 2, 3, ..., N])/N
> 
> which by a bit of simple algebra works out to be (N+1)/2, or half the 
> characters as I said.

Yes, but it's a flawed assumption you're making since you're neglecting
lazy evaluation and early abort of comparison.

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1 at speranza.aioe.org>