optimization pointers?

Thu Dec 11 17:17:15 EST 2003

In article <Bz5Cb.7881$_r6.5116 at newsread1.news.pas.earthlink.net>,
"r.e.s." <r.s at XXmindspring.com> wrote:
>Would anyone care to offer pointers on how the following code
>might be optimized?  Or alternatives?  ...
>
>---
>def lz(string):
>    """ Return the Lempel-Ziv complexity (LZ78) of string. """
>
>    vocabulary = []
>    word = ''
>    for char in string:
>        word += char
>        if word not in vocabulary:
>            vocabulary.append(word)
>            word = ''
>    return len(vocabulary)
>---
>
>On my 2 GHz system running PythonWin, the above code averages
>about 1.3 hrs to process a 1 MB string.  I'd hoped to compute
>LZ78 for multi-GB files, but it seems infeasible. (?)

One simple improvement is to get rid of

        if an_item not in a_list

which does a linear search over the list.  A dictionary is faster.

Without a full benchmark, it seems that appending

def lz_complexity (s):
    voc = {}
    word = ''
    for c in s:
        word += c
        if not voc.get (word, False):
            voc[word] = True
            word = ''
    return len (voc.keys())

if __name__ == '__main__':
    import time
    text = file ('lzc.py', 'rt').read()
    t0 = time.clock()
    t0 = time.clock() - t0

    c = lz (text)
    t1 = time.clock()
    c = lz (text)
    t1 = time.clock() - t1

    c = lz_complexity (text)
    t2 = time.clock()
    c = lz_complexity (text)
    t2 = time.clock() - t2

    print 'T1:', t1 - t0
    print 'T2:', t2 - t0

will show about a x6 speedup.

Avoiding

        a_string += another_string

is a traditional speedup, but it seems you'll be needing the
string form for every character processed, so it may not buy
much in this case.

        Regards.    Mel.