optimization pointers?

Mel Wilson mwilson at the-wire.com
Thu Dec 11 17:17:15 EST 2003


In article <Bz5Cb.7881$_r6.5116 at newsread1.news.pas.earthlink.net>,
"r.e.s." <r.s at XXmindspring.com> wrote:
>Would anyone care to offer pointers on how the following code
>might be optimized?  Or alternatives?  ...
>
>---
>def lz(string):
>    """ Return the Lempel-Ziv complexity (LZ78) of string. """
>
>    vocabulary = []
>    word = ''
>    for char in string:
>        word += char
>        if word not in vocabulary:
>            vocabulary.append(word)
>            word = ''
>    return len(vocabulary)
>---
>
>On my 2 GHz system running PythonWin, the above code averages
>about 1.3 hrs to process a 1 MB string.  I'd hoped to compute
>LZ78 for multi-GB files, but it seems infeasible. (?)

One simple improvement is to get rid of

        if an_item not in a_list

which does a linear search over the list.  A dictionary is faster.

Without a full benchmark, it seems that appending


def lz_complexity (s):
    voc = {}
    word = ''
    for c in s:
        word += c
        if not voc.get (word, False):
            voc[word] = True
            word = ''
    return len (voc.keys())

if __name__ == '__main__':
    import time
    text = file ('lzc.py', 'rt').read()
    t0 = time.clock()
    t0 = time.clock() - t0

    c = lz (text)
    t1 = time.clock()
    c = lz (text)
    t1 = time.clock() - t1

    c = lz_complexity (text)
    t2 = time.clock()
    c = lz_complexity (text)
    t2 = time.clock() - t2

    print 'T1:', t1 - t0
    print 'T2:', t2 - t0


will show about a x6 speedup.

Avoiding

        a_string += another_string

is a traditional speedup, but it seems you'll be needing the
string form for every character processed, so it may not buy
much in this case.

        Regards.    Mel.




More information about the Python-list mailing list