Comparing strings from the back?

Tue Sep 4 03:56:11 EDT 2012

On 04/09/2012 02:54, Roy Smith wrote:
> There's been a bunch of threads lately about string implementations, and
> that got me thinking (which is often a dangerous thing).
>
> Let's assume you're testing two strings for equality.  You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?  If the strings are indeed equal, it's the
> same amount of work starting from either end.  But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.
>
> It doesn't seem un-plausible that this is the case.  For example, most
> of the filenames I work with begin with "/home/roy/".  Most of the
> strings I use as memcache keys have one of a small number of prefixes.
> Most of the strings I use as logger names have common leading
> substrings.  Things like credit card and telephone numbers tend to have
> much more entropy in the trailing digits.
>
> On the other hand, hostnames (and thus email addresses) exhibit the
> opposite pattern.
>
> Anyway, it's just a thought.  Has anybody studied this for real-life
> usage patterns?
>
> I'm also not sure how this work with all the possible UCS/UTF encodings.
> With some of them, you may get the encoding semantics wrong if you don't
> start from the front.
>

Good grief, what timing!!!  The dust has yet to settle over the 
"Flexible string representation, unicode, typography, ..." thread and 
you ask this.  I'll just cross my fingers and hope that people stick 
with facts, realistic benchmarks etc, and not good old fashioned FUD.

-- 
Cheers.

Mark Lawrence.