Comparing strings from the back?

Oscar Benjamin oscar.j.benjamin at gmail.com
Wed Sep 5 09:20:49 EDT 2012


On 5 September 2012 10:48, Peter Otten <__peter__ at web.de> wrote:

> Chris Angelico wrote:
>
> > On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <__peter__ at web.de> wrote:
> >> comparing every pair in a sample of 1000 8-char words
> >> taken from '/usr/share/dict/words'
> >>
> >> head
> >> 1: 477222 ************************************************************
> >> 2:  18870 **
> >> ...
>
> tail
> 1: 386633 ************************************************************
> 2:  74966 ***********
>
>
> > Not understanding this. What are the statistics,
>
> I measured how many chars they have in common for all combinations of 1000
> words taken from /usr/share/dict/words.
>
> 477222 pairs have one char in common, 18870 pairs have two chars in common
> when compared from the *beginning* of the string.
>
> 386633 pairs have one char in common, 74966 pairs have two chars in common
> when compared from the *end* of the string.
>
> and what (if it's not obvious from the previous answer) do they prove?
>
> They demonstrate that for two words from that particular corpus it is
> likely
> that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average)
> characters into account than a == b, i. e. comparing from the back should
> be
> slower rather than faster.
>
> If that doesn't help, here's the code ;)
>  <snip>

def count_common(a, b):
>     i = 0
>     for i, (x, y) in enumerate(zip(a, b), 1):
>         if x != y:
>             break
>     return i
>

This function will return 1 if the first character differs. It does not
count the number of common characters but rather the more relevant quantity
which is the number of comparisons required to decide if the strings are
equal.

It shows that, for the English words in this dictionary, the first letters
of two randomly selected 8-character words are equal ~5% of the time while
the last letters are equal ~20% of the time.

But despite the non-uniformity in the distribution of these strings,
this provides a good example of the fact that for many situations involving
real data, average case comparison complexity is O(1). This is because the
probability of stopping after N comparisons decreases exponentially with N,
so that the sequence of counts forms something loosely like an arithmetic
progression:

>>> cmp_forward = [477222, 18870, 2870, 435, 74, 17, 12]
>>> cmp_backward = [386633, 74966, 29698, 6536, 1475, 154, 28, 10]
>>> def ratios(seq):
...     for count1, count2 in zip(seq[:-1], seq[1:]):
...         yield count2 / float(count1)
...
>>> list(ratios(cmp_forward))
[0.03954134553729711, 0.15209326974032855, 0.15156794425087108,
0.17011494252873563, 0.22972972972972974, 0.7058823529411765]
>>> list(ratios(cmp_backward))
[0.19389446839767946, 0.39615292265827173, 0.22008216041484274,
0.22567319461444307, 0.10440677966101695, 0.18181818181818182,
0.35714285714285715]

A notable outlier in these sequences is for comparing the first character
of the two words which is why for this string distribution it is better to
start at the beginning than the end.

Oscar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20120905/c28e6f09/attachment.html>


More information about the Python-list mailing list