Comparing strings from the back?

Tue Sep 4 05:58:08 EDT 2012

Roy Smith <roy at panix.com> writes:

> There's been a bunch of threads lately about string implementations, and 
> that got me thinking (which is often a dangerous thing).
>
> Let's assume you're testing two strings for equality.  You've already 
> done the obvious quick tests (i.e they're the same length), and you're 
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings 
> instead of at the beginning?  If the strings are indeed equal, it's the 
> same amount of work starting from either end.  But, if it turns out that 
> for real-life situations, the ends of strings have more entropy than the 
> beginnings, the odds are you'll discover that they're unequal quicker by 
> starting at the end.

I guess we all have examples with specific entropy distribution. Going
backwards on long strings can be profitable with file paths. If that is
the case, and if you spend a lot of time comparing strings, why not
store them in reverse?

> It doesn't seem un-plausible that this is the case.  For example, most 
> of the filenames I work with begin with "/home/roy/".  Most of the 
> strings I use as memcache keys have one of a small number of prefixes.  
> Most of the strings I use as logger names have common leading 
> substrings.

In that case I would split the paths, and use some sort of string-id, or
use intern() and then compare components with "is" instead of "==".
(Basically, turning the string of chars into a strings of
ids/ints/pointers.)

> Things like credit card and telephone numbers tend to have much more
> entropy in the trailing digits. On the other hand, hostnames (and thus
> email addresses) exhibit the opposite pattern.

Yes, real-world is a mess.

> Anyway, it's just a thought.  Has anybody studied this for real-life 
> usage patterns?

I've tried the "intern()" trick several times with lists of strings, and
was satisfied, but this was in specific cases (with a finite set of
strings). For "short" strings of chars (up to, say a hundred of
characters), I've never found anything significantly faster than the
default sweep (that was in C/C++, not python). For longer and/or more
structured strings/lists, making use of the structure is a good idea,
but I see no general pattern here (as you note).

> I'm also not sure how this work with all the possible UCS/UTF encodings.  
> With some of them, you may get the encoding semantics wrong if you don't 
> start from the front.

If you're testing for equality, I can't see how this could matter, even
with variable-length encodings.

If you're comparing different encodings, then you need different access
methods to random characters (but this doesn't affect the algorithm). If
you're using variable-length encoding, e.g., UTF-8, accessing at a
specific position is not possible.

-- Alain.