Some optimization tale

Sat Dec 27 15:23:34 EST 2003

"Stephan Diehl" <stephan.diehlNOSPAM at gmx.net> wrote in message
news:bskcb0$36n$00$1 at news.t-online.com...
> All of this can be found at
> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252177
>
> What I was most surprised at was the inefficency of the trivial solutions
> (and that the right algorithm makes indeed a difference).

A common surprise.  The science of algorithms (including empirical testing)
gives real benefits.

> -- 
os.path.commonprefix --------------------------------------------------
>
> def f1(m):
>     "Given a list of pathnames, returns the longest common leading
> component"
>     if not m: return ''
>     prefix = m[0]

    prefix = m.pop() # avoids comparing prefix to itself as first item

>     for item in m:
>         for i in range(len(prefix)):
>             if prefix[:i+1] != item[:i+1]:

I am 99% sure that above is a typo and performance bug that should read
             if prefix[i:i+1] != item[i:i+1]:
since all previous chars have been found equal in previous iteration.

>                 prefix = prefix[:i]
>                 if i == 0:
>                     return ''
>                 break
>     return prefix

Perhaps you could retest this with the two suggested changes.  It will be
somewhat faster.

> The problem with this algorithm is the copying of all those small
strings.

Since Python does not have pointers, comparing characters within strings
requires pulling out the chars as separate strings.  This is why using
C-coded comparison functions may win even though more comparisons are done.

The reason f1uses slicing (of len 1 after my correction) instead of
indexing is to avoid exceptions when len(item) < len(prefix).  However, all
the +1s have a cost (and I suspect slicing does also), so it may pay to
truncate prefix to the length of item first.  The simplest fix for this
(untested) gives

def f9(m): # modified f1 == os.path.commonprefix
    "Given a list of strings, returns the longest common prefix"
    if not m: return ''
    prefix = m.pop()
    for item in m:
        prefix = prefix[:len(item)]
        for i in range(len(prefix)):
            if prefix[i] != item[i]:
                if not i:            return ''
                prefix = prefix[:i]
                break
    return prefix

> This can be easily fixed
> --- optimized
os.path.commonprefix ----------------------------------------
>
> def f2(m):
>     "Given a list of pathnames, returns the longest common leading
> component"
>     if not m: return ''
>     if len(m) == 1: return m[0]
>     prefix = m[0]
>     for i in xrange(len(prefix)):
>         for item in m[1:]:
>             if prefix[i] != item[i]:
>                 return prefix[:i]
>     return prefix[:i]

and easily messed up;-)  If len(item) < len(prefix), item[i] throws
exception.  For this approach to work, prefix should be set as shortest
string of m in preliminary loop.  Also, you reslice m several times.  Do it
once before outer loop.

> It is just not nessesary to compare all strings in the list.

Every string has to be compared to something at least once.

> It is enough to sort the list first
> and then compare the first and the last element.

Sorting compares all strings in the list to something at least once, and
most more than once.

> Even though the 'sort' algorithm is coded in C and is therefore quite
fast,
> the order of runtime has changed.

The C part is what makes f3 faster.  In your timings, 128 is not large
enough for the nlogn component to be noticeable.

> Michael Dyck then pointed out that instead of using 'sort', 'min' and
'max'
> should be used. While tests suggest that this is true, I have no idea why
> that should be, since finding a minimum or maximum uses some sorting
anyway

No.  max and min each do a linear scan.  No sorting.  But each does at
least as many character comparisons as modified f1 or f2.  The speedup is
from looping and comparing in C, even though at least twice as many
compares are done.

> You might have realized that the optimization so far was done one the
number
> of strings. There is still another dimension to optimize in and that is
the
> actual string comparing.
> Raymond Hettinger suggests using a binary search:

Since this only affects the final comparison of min and max, and not the n
comparisons done to calculate each, the effect is minimal and constant,
independent of number of strings.

Since this compares slices rather than chars in each loop, I wonder whether
this is really faster than linear scan anyway.  I would like to see timing
of f5 with min/max of f4 combined with linear scan of f3.  (Basically, f3
with sort removed and min/max added.)  Since you changed two parts of f3 to
get f4, we cannot be sure that both changes are each an improvement even
though the combination of two is.

def f5(seq):
    if not seq: return ''
    s1 = min(seq)
    s2 = max(seq)
    n = min(len(s1), len(s2))
    if not n: return '' # not required since s1[0:0] == ''
    for i in xrange(n) :
        if s1[i] != s2[i] :
            return s1[0:i]
    return s1[0:n]

Terry J. Reedy