Some optimization tale

Mon Dec 29 17:12:16 EST 2003

On Sat, 27 Dec 2003 15:23:34 -0500, "Terry Reedy" <tjreedy at udel.edu> wrote:

>
>"Stephan Diehl" <stephan.diehlNOSPAM at gmx.net> wrote in message
>news:bskcb0$36n$00$1 at news.t-online.com...
>> All of this can be found at
>> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252177
>>
>> What I was most surprised at was the inefficency of the trivial solutions
>> (and that the right algorithm makes indeed a difference).
>
>A common surprise.  The science of algorithms (including empirical testing)
>gives real benefits.
>
>> -- 
>os.path.commonprefix --------------------------------------------------
>>
>> def f1(m):
>>     "Given a list of pathnames, returns the longest common leading
>> component"
>>     if not m: return ''
>>     prefix = m[0]
>
>    prefix = m.pop() # avoids comparing prefix to itself as first item
>
>>     for item in m:
>>         for i in range(len(prefix)):
>>             if prefix[:i+1] != item[:i+1]:
>
>I am 99% sure that above is a typo and performance bug that should read
>             if prefix[i:i+1] != item[i:i+1]:
>since all previous chars have been found equal in previous iteration.
>
>>                 prefix = prefix[:i]
>>                 if i == 0:
>>                     return ''
>>                 break
>>     return prefix
>
>Perhaps you could retest this with the two suggested changes.  It will be
>somewhat faster.
>
>> The problem with this algorithm is the copying of all those small
>strings.
>
>Since Python does not have pointers, comparing characters within strings
>requires pulling out the chars as separate strings.  This is why using
>C-coded comparison functions may win even though more comparisons are done.
>
>The reason f1uses slicing (of len 1 after my correction) instead of
>indexing is to avoid exceptions when len(item) < len(prefix).  However, all
>the +1s have a cost (and I suspect slicing does also), so it may pay to
>truncate prefix to the length of item first.  The simplest fix for this
>(untested) gives
>
>def f9(m): # modified f1 == os.path.commonprefix
>    "Given a list of strings, returns the longest common prefix"
>    if not m: return ''
>    prefix = m.pop()
>    for item in m:
>        prefix = prefix[:len(item)]
>        for i in range(len(prefix)):
>            if prefix[i] != item[i]:
>                if not i:            return ''
>                prefix = prefix[:i]
>                break
>    return prefix
>
>> This can be easily fixed
>> --- optimized
>os.path.commonprefix ----------------------------------------
>>
>> def f2(m):
>>     "Given a list of pathnames, returns the longest common leading
>> component"
>>     if not m: return ''
>>     if len(m) == 1: return m[0]
>>     prefix = m[0]
>>     for i in xrange(len(prefix)):
>>         for item in m[1:]:
>>             if prefix[i] != item[i]:
>>                 return prefix[:i]
>>     return prefix[:i]
>
>and easily messed up;-)  If len(item) < len(prefix), item[i] throws
>exception.  For this approach to work, prefix should be set as shortest
>string of m in preliminary loop.  Also, you reslice m several times.  Do it
>once before outer loop.
>
>> It is just not nessesary to compare all strings in the list.
>
>Every string has to be compared to something at least once.
>
>> It is enough to sort the list first
>> and then compare the first and the last element.
>
>Sorting compares all strings in the list to something at least once, and
>most more than once.
>
>> Even though the 'sort' algorithm is coded in C and is therefore quite
>fast,
>> the order of runtime has changed.
>
>The C part is what makes f3 faster.  In your timings, 128 is not large
>enough for the nlogn component to be noticeable.
>
>> Michael Dyck then pointed out that instead of using 'sort', 'min' and
>'max'
>> should be used. While tests suggest that this is true, I have no idea why
>> that should be, since finding a minimum or maximum uses some sorting
>anyway
>
>No.  max and min each do a linear scan.  No sorting.  But each does at
>least as many character comparisons as modified f1 or f2.  The speedup is
>from looping and comparing in C, even though at least twice as many
>compares are done.
>
>> You might have realized that the optimization so far was done one the
>number
>> of strings. There is still another dimension to optimize in and that is
>the
>> actual string comparing.
>> Raymond Hettinger suggests using a binary search:
>
>Since this only affects the final comparison of min and max, and not the n
>comparisons done to calculate each, the effect is minimal and constant,
>independent of number of strings.
>
>Since this compares slices rather than chars in each loop, I wonder whether
>this is really faster than linear scan anyway.  I would like to see timing
>of f5 with min/max of f4 combined with linear scan of f3.  (Basically, f3
>with sort removed and min/max added.)  Since you changed two parts of f3 to
>get f4, we cannot be sure that both changes are each an improvement even
>though the combination of two is.
>
>def f5(seq):
>    if not seq: return ''
>    s1 = min(seq)
>    s2 = max(seq)
>    n = min(len(s1), len(s2))
>    if not n: return '' # not required since s1[0:0] == ''
>    for i in xrange(n) :
>        if s1[i] != s2[i] :
>            return s1[0:i]
>    return s1[0:n]
>
I wonder about this version for speed (not very tested ;-):

 >>> def f10(m):
 ...     "return longest common prefix of strings in seq"
 ...     if not m: return ''
 ...     prefix = m.pop()
 ...     ssw = str.startswith
 ...     for item in m:
 ...         while not ssw(item, prefix): prefix = prefix[:-1]
 ...         if not prefix: return ''
 ...     return prefix
 ...
 >>> f10('abcd abcd'.split())
 'abcd'
 >>> f10('abcd abce'.split())
 'abc'
 >>> f10('abcd abcd a'.split())
 'a'
 >>> f10('abcd abcd a x'.split())
 ''

Regards,
Bengt Richter