Comparing strings from the back?

Thu Sep 6 04:33:14 EDT 2012

On Wed, 05 Sep 2012 22:47:14 +0000, Oscar Benjamin wrote:

> In news.gmane.comp.python.general, you wrote:
>> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...]
>>>> You are making unjustified assumptions about the distribution of
>>>> letters in the words. This might be a list of long chemical compounds
>>>> where the words typically differ only in their suffix. It might be a
>>>> list of people with titles:
>>> 
>>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>>> get my calculation. I'm comparing *random* character strings or
>>> bitstrings.
>>
>> Excuse me, you are not. You are comparing English words which are
>> highly non-random.
> 
> Evidently we have different understandings of what 'random' means. I
> don't think it's unreasonable to say that strings drawn uniformly from
> the set of all strings in the English language (having a given number of
> characters) is random. The distribution is not uniform over the set of
> all possible character strings but it is still random. I think Johannes
> deliberately chose these strings to emulate a particular kind of 'real'
> distribution of strings that might occur in practise.

Of course for some "real" applications, your strings being compared will 
be English words. And for other applications, they will be strings of 
numeric digits uniformly distributed between 20000 and 30000. Or taken 
from a Gaussian distribution centered around 7000. Or strings made up of 
deeply nested sets of parentheses (((((( ... )))))). Whichever special 
distribution of strings you pick, I don't think calling them "random" is 
terribly useful in context, even if they are generated randomly from some 
non-uniform distribution.

But you know, it really doesn't make a difference. Equality testing will 
*still* be O(N) since the asymptomatic behaviour for sufficiently large 
string comparisons will be bounded by O(N). Multiplicative constants 
("half the string" versus "0.001 of the string") do not matter.

I may have been overly-conservative earlier when I said that on average 
string equality has to compare half the characters. I thought I had 
remembered that from a computer science textbook, but I can't find that 
reference now, so possibly I was thinking of something else. (String 
searching perhaps?). In any case, the *worst* case for string equality 
testing is certainly O(N) (every character must be looked at), and the 
*best* case is O(1) obviously (the first character fails to match). But 
I'm not so sure about the average case. Further thought is required.

-- 
Steven