Comparing strings from the back?
Alain Ketterlin
alain at dpt-info.u-strasbg.fr
Tue Sep 4 05:58:08 EDT 2012
Roy Smith <roy at panix.com> writes:
> There's been a bunch of threads lately about string implementations, and
> that got me thinking (which is often a dangerous thing).
>
> Let's assume you're testing two strings for equality. You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning? If the strings are indeed equal, it's the
> same amount of work starting from either end. But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.
I guess we all have examples with specific entropy distribution. Going
backwards on long strings can be profitable with file paths. If that is
the case, and if you spend a lot of time comparing strings, why not
store them in reverse?
> It doesn't seem un-plausible that this is the case. For example, most
> of the filenames I work with begin with "/home/roy/". Most of the
> strings I use as memcache keys have one of a small number of prefixes.
> Most of the strings I use as logger names have common leading
> substrings.
In that case I would split the paths, and use some sort of string-id, or
use intern() and then compare components with "is" instead of "==".
(Basically, turning the string of chars into a strings of
ids/ints/pointers.)
> Things like credit card and telephone numbers tend to have much more
> entropy in the trailing digits. On the other hand, hostnames (and thus
> email addresses) exhibit the opposite pattern.
Yes, real-world is a mess.
> Anyway, it's just a thought. Has anybody studied this for real-life
> usage patterns?
I've tried the "intern()" trick several times with lists of strings, and
was satisfied, but this was in specific cases (with a finite set of
strings). For "short" strings of chars (up to, say a hundred of
characters), I've never found anything significantly faster than the
default sweep (that was in C/C++, not python). For longer and/or more
structured strings/lists, making use of the structure is a good idea,
but I see no general pattern here (as you note).
> I'm also not sure how this work with all the possible UCS/UTF encodings.
> With some of them, you may get the encoding semantics wrong if you don't
> start from the front.
If you're testing for equality, I can't see how this could matter, even
with variable-length encodings.
If you're comparing different encodings, then you need different access
methods to random characters (but this doesn't affect the algorithm). If
you're using variable-length encoding, e.g., UTF-8, accessing at a
specific position is not possible.
-- Alain.
More information about the Python-list
mailing list