Comparing strings from the back?

Wed Sep 5 04:29:14 EDT 2012

Roy Smith wrote:

> There's been a bunch of threads lately about string implementations, and
> that got me thinking (which is often a dangerous thing).
> 
> Let's assume you're testing two strings for equality.  You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
> 
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?  If the strings are indeed equal, it's the
> same amount of work starting from either end.  But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.
> 
> It doesn't seem un-plausible that this is the case.  For example, most
> of the filenames I work with begin with "/home/roy/".  Most of the
> strings I use as memcache keys have one of a small number of prefixes.
> Most of the strings I use as logger names have common leading
> substrings.  Things like credit card and telephone numbers tend to have
> much more entropy in the trailing digits.

But do you actually compare them with each other? Even if you do I'd guess 
that Python looks at the hash values most of the time.

> On the other hand, hostnames (and thus email addresses) exhibit the
> opposite pattern.

Nor do english words:

$ python3 reverse_compare2.py compare /usr/share/dict/words --word-len 8

comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head
1: 477222 ************************************************************
2:  18870 **
3:   2870 
4:    435 
5:     74 
6:     17 
7:     12 

tail
1: 386633 ************************************************************
2:  74966 ***********
3:  29698 ****
4:   6536 *
5:   1475 
6:    154 
7:     28 
8:     10 

> Anyway, it's just a thought.  Has anybody studied this for real-life
> usage patterns?
> 
> I'm also not sure how this work with all the possible UCS/UTF encodings.
> With some of them, you may get the encoding semantics wrong if you don't
> start from the front.