Comparing strings from the back?
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu Sep 6 04:33:14 EDT 2012
On Wed, 05 Sep 2012 22:47:14 +0000, Oscar Benjamin wrote:
> In news.gmane.comp.python.general, you wrote:
>> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...]
>>>> You are making unjustified assumptions about the distribution of
>>>> letters in the words. This might be a list of long chemical compounds
>>>> where the words typically differ only in their suffix. It might be a
>>>> list of people with titles:
>>>
>>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>>> get my calculation. I'm comparing *random* character strings or
>>> bitstrings.
>>
>> Excuse me, you are not. You are comparing English words which are
>> highly non-random.
>
> Evidently we have different understandings of what 'random' means. I
> don't think it's unreasonable to say that strings drawn uniformly from
> the set of all strings in the English language (having a given number of
> characters) is random. The distribution is not uniform over the set of
> all possible character strings but it is still random. I think Johannes
> deliberately chose these strings to emulate a particular kind of 'real'
> distribution of strings that might occur in practise.
Of course for some "real" applications, your strings being compared will
be English words. And for other applications, they will be strings of
numeric digits uniformly distributed between 20000 and 30000. Or taken
from a Gaussian distribution centered around 7000. Or strings made up of
deeply nested sets of parentheses (((((( ... )))))). Whichever special
distribution of strings you pick, I don't think calling them "random" is
terribly useful in context, even if they are generated randomly from some
non-uniform distribution.
But you know, it really doesn't make a difference. Equality testing will
*still* be O(N) since the asymptomatic behaviour for sufficiently large
string comparisons will be bounded by O(N). Multiplicative constants
("half the string" versus "0.001 of the string") do not matter.
I may have been overly-conservative earlier when I said that on average
string equality has to compare half the characters. I thought I had
remembered that from a computer science textbook, but I can't find that
reference now, so possibly I was thinking of something else. (String
searching perhaps?). In any case, the *worst* case for string equality
testing is certainly O(N) (every character must be looked at), and the
*best* case is O(1) obviously (the first character fails to match). But
I'm not so sure about the average case. Further thought is required.
--
Steven
More information about the Python-list
mailing list