Python example: possible speedup?

Hrvoje Niksic hniksic at srce.hr
Wed Sep 8 11:37:38 EDT 1999


As a Python exercise, I wrote a simple program to "scratch an itch",
i.e. do something useful.  However, I found that Python's lack of
speed really bytes me here, so I'd like to hear suggestions for
speedup.  People who don't like that kind of topic, please skip to the 
following article.  Others, read on.

When browsing my Debian packages, I found I often wanted to know the
size of the installed packages.  To achieve this, I wrote a simple
program that prints the installed packages and their sizes, all sorted
by size.  Output looks like this:

tetex-base: 27652
tetex-extra: 27118
cmucl-safe: 22174
mozilla: 13007
timidity-patches: 11662
locales: 11489
quake2: 10800
[...]

To achieve this, the program needs to parse two files:
/var/lib/dpkg/status, to find which packages are installed, and
/var/lib/dpkg/available, to find the size of the installed packages.
Both files are textual, in the form of headers.  Like this:

Package: telnet
Status: install ok installed
Priority: standard
Section: net
Installed-Size: 130
Maintainer: Herbert Xu <herbert at debian.org>
Source: netkit-telnet
Version: 0.14-4
Replaces: netstd
Depends: libc6 (>= 2.1), libncurses4 (>= 4.2-3.1)
Description: The telnet client.
 The telnet command is used for interactive communication with another host
 using the TELNET protocol.

Package: packaging-manual
Status: purge ok not-installed
Priority: optional
Section: doc
[...]

The program was quite easy to write, and easy to read afterwards.  The
problem is that it is also quite slow.  On my system, it takes about
27 CPU seconds (as reported by `time' shell builtin) to do the work,
which can extend to more than a minute of real time, depending on the
system load.

As a comparison, the equivalent Perl program does the same thing in 9
CPU seconds.  I tried everything I knew to make the Python version
fast.  I tried to use `re' to avoid returning headers other than the
ones we're interested in.  I tried changing self.__current to just
current to avoid a dictionary lookup.  I tried to make self.__current
a list, to avoid the expensive `current = current + line' operation.
All of these things made the program measure slower.

I would really appreciate some suggestions.  The code is not large,
and is (I hope) rather elegant.  I am a Python beginner, so I'd also
appreciate tips on Python style and OO technique.  I'll post/mail the
Perl equivalent on demand.

The code follows:

#!/usr/bin/python

import string

class Dpkg_Reader:
    def __init__(self, file):
        self.__current = ''
        self.__fp = open(file)

    def next_header(self):
        while 1:
            line = self.__fp.readline()
            if line == '':
                return (None, None)     # EOF
            if self.__current:
                if line == "\n" or (line[0] != ' ' and line[0] != "\t"):
                    #print "{%s}" % (self.__current)
                    try:
                        name, value = string.split(self.__current, ': ', 1)
                    except:
                        name, value = string.split(self.__current, ':', 1)
                    value = value[:-1]
                    if line == "\n":
                        self.__current = ''
                    else:
                        self.__current = line
                    return (name, value)
            self.__current = self.__current + line

    def close(self):
        self.__fp.close()

def process_status(file):
    reader = Dpkg_Reader(file)
    package = None
    installed = {}
    while 1:
        name, value = reader.next_header()
        if name is None:
            break
        if name == 'Package':
            package = value
        elif name == 'Status':
            status = string.split(value, ' ')
            if status[2] == 'installed':
                installed[package] = 1
    reader.close()
    return installed

def process_available(file, installed):
    reader = Dpkg_Reader(file)
    sizes = {}
    while 1:
        name, value = reader.next_header()
        if name is None:
            break
        elif name == 'Package':
            package = value
        elif name == 'Installed-Size':
            if installed.has_key(package):
                sizes[package] = string.atoi(value)
    reader.close()
    return sizes

def main():
    installed = process_status('/var/lib/dpkg/status')
    sizes = process_available('/var/lib/dpkg/available', installed)
    lst = sizes.keys()
    lst.sort(lambda a, b, sizes=sizes: cmp(sizes[b], sizes[a]))
    for pack in lst:
        print "%s: %d" % (pack, sizes[pack])

if __name__ == '__main__':
    main()




More information about the Python-list mailing list