Comparing strings from the back?

Mon Sep 10 09:45:01 EDT 2012

On Mon, 10 Sep 2012 08:59:37 +0000, Duncan Booth wrote:

> Gelonida N <gelonida at gmail.com> wrote:
> 
>> On 09/07/2012 06:06 AM, Steven D'Aprano wrote:
>>> On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:
>>>
>>>
>>> Also of some interest is the best case: O(1) for unequal strings (they
>>> differ at the first character) and O(N) for equal strings.
>> 
>> The worst case is O(N) or N characters the average case is O(1) or two
>> characters.
>> 
>> For denial of service attacks or systems, that are NEVER allowed to
>> fail the worst case is important.
>> 
>> For most other cases the average complexity counts.
>> 
>> However I still wonder for how many applications the complexity of
>> string comparisons would be the limiting factor.
>> 
>> 
> and of course if you ever do find an application where that worst case
> matters there's an easy way round it: just call intern() on all the
> strings when they are created.

There are other good reasons for interning strings, but speeding up 
"astring == bstring" is not one of them.

[steve at ando ~]$ python -m timeit -s "s = 'abc'*100000" -s \
> "t = s[:-1] + 'x'" "s == t"
1000 loops, best of 3: 910 usec per loop
[steve at ando ~]$ python -m timeit -s "s = 'abc'*100000" -s \
> "t = s[:-1] + 'x'" -s "intern(s); intern(t)" "s == t"
1000 loops, best of 3: 914 usec per loop

No significant difference.

To state the bleedin' obvious, the computational effort required to 
*compare two strings* pre-supposes that those strings already exist, and 
is *completely independent* of the complexity of creating the strings.

> For the comparison to be the limiting factor you have to be doing a lot
> of comparisons 

Correct.

> on the same string (otherwise creating the string would be the limiting
> factor),

Not necessarily.

Look, it's really hard to come up with a realistic, non-contrived example 
where string comparisons are a significant bottleneck in a non-toy 
application. So let me first off say that *nearly always* you're not 
going to care whether "s == t" looks at all N characters or just the 
first 2 (or 20, or 100). This argument is rather academic (the best sort 
of argument!). Until you start getting up into truly huge strings, we're 
arguing about how many angels can dance on a CPU cache.

But for the record, in principle string comparisons *could* be the 
bottleneck. Example: you have 10000 strings, which are each created once 
and stored in a list. Then you iterate over the list, comparing every 
string against every other string. And due to some weird vagary of the 
data, the strings are nearly all equal.

(Why would you do this? I don't know, maybe it's a programmers' challenge 
found on the Internet, make up your own scenario...)

Total number of strings created: 10000.
Total number of strings compared: 100000000.

The overhead of creating the strings is trivial compared to the overhead 
of comparing them, and since each string is only created once anyway, 
interning them is just a waste of time.

> so at the expense of a single dictionary
> insertion when the string is created you can get guaranteed O(1) on all
> the comparisons.

What interning buys you is that "s == t" is an O(1) pointer compare if 
they are equal. But if s and t differ in the last character, __eq__ will 
still inspect every character. There is no way to tell Python "all 
strings are interned, if s is not t then s != t as well".

-- 
Steven