Comparing strings from the back?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Sep 7 20:55:34 EDT 2012


On Fri, 07 Sep 2012 19:10:16 +0000, Oscar Benjamin wrote:

> On 2012-09-07, Steven D'Aprano <steve+comp.lang.python at pearwood.info>
> wrote:
>> <snip>
>>
>> After further thought, and giving consideration to the arguments given
>> by people here, I'm now satisfied to say that for equal-length strings,
>> string equality is best described as O(N).
>>
>> 1) If the strings are equal, a == b will always compare all N
>>    characters in each string.
>>
>> 2) If the strings are unequal, a == b will *at worst* compare
>>    all N characters.
>>
>> 3) Following usual practice in this field, worst case is the
>>    one which conventionally is meant when discussing Big Oh behaviour.
>>    See, for example, "Introduction To Algorithms" by Cormen, Leiserson
>>    and Rivest.
> 
> Would you say, then, that dict insertion is O(N)?

Pedantically, yes. 

But since we're allowed to state (or even imply *wink*) whatever 
assumptions we like, we're allowed to assume "in the absence of 
significant numbers of hash collisions" and come up with amortized O(1) 
for dict insertions and lookups.

(Provided, of course, that your computer has an infinite amount of 
unfragmented memory and the OS never starts paging your dict to disk. 
Another unstated assumption that gets glossed over when we talk about 
complexity analysis -- on real world computers, for big enough N, 
*everything* is O(2**N) or worse.)

Big Oh analysis, despite the formal mathematics used, is not an exact 
science. Essentially, it is a way of bringing some vague order to hand-
wavy estimates of complexity, and the apparent mathematical rigour is 
built on some awfully shaky foundations. But despite that, it actually is 
useful.

Coming back to strings... given that in any real-world application, you 
are likely to have some string comparisons on equal strings and some on 
unequal strings, and more importantly you don't know which are which 
ahead of time, which attitude is less likely to give you a nasty surprise 
when you run your code?

"I have many millions of 100K strings to compare against other 100K 
strings, and string comparisons are O(1) so that will be fast."

"I have many millions of 100K strings to compare against other 100K 
strings, and string comparisons are O(N) so that will be slow, better 
find another algorithm."


Remember too that "for small enough N, everything is O(1)". Getting hung 
up on Big Oh is just as much a mistake as ignoring it completely.


> I find this idea of separating into the comparison of equal strings
> versus the comparison of unequal strings rather odd. If the strings you
> compare come from a distribution where they are guaranteed to be equal
> (or unequal) then you can just use the O(0) comparison method.

If you know that they're (un)equal, you don't need to call == at all.

If you know that "most" strings will be unequal, then you might be 
justified in treating comparisons as O(1) "most of the time" and not 
stress about the occasional slow call. But of course most of the time you 
don't know this, which is why it is appropriate to treat string 
comparisons as O(N) rather than O(1), since that's the upper bound.


> Since string comparison is only useful if the strings can be equal or
> unequal, the average case depends on how often they are equal/unequal as
> well as the average complexity of both. For random strings the frequency
> of equal strings decreases very fast as N increases so that the
> comparison of random strings is O(1).

But that is not an upper bound, and Big Oh analysis is strictly defined 
in terms of upper bounds.



>> (I'm talking about the average here -- the actual number of comparisons
>> can range all the way up to N, but the average is <= 2.)
>>
>> If I've done the maths right, the exact value for the average is:
>>
>> ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)
> 
> I'm not sure where the extra N comes from --------^ but otherwise good.

The extra N comes from the case where you compare string S with itself, 
which takes exactly N comparisons.



-- 
Steven



More information about the Python-list mailing list