looking for speed-up ideas

Andrew Dalke adalke at mindspring.com
Mon Feb 3 20:34:02 EST 2003


Ram Bhamidipaty wrote:
> I have some python code that processes a large file. I want to see how
> much faster this code can get. Mind you, I don't _need_ the code to go
> faster - but it sure would be nice if it were faster...

Don't create the FileSize object.  Use a simple tuple instead.  With
an object you have higher overheads to create the object and to make
the comparison.

Try this.  I don't have heap so I do a sort and cut every once in a
while.  It also doesn't do full error checking in case the input isn't
in the right format.  And it uses a more recent version of Python than
the code you have (eg, no need for xreadlines)

This should be quite fast.

def process(infile):
     dirid_info = {}

     line = infile.readline()
     assert line[:1] == "T"
     ignore, dirname, dirid = line.split()
     dirid_info[dirid] = (None, dirname)

     fileinfo = []

     for line in infile:
         if line[:1] == "F":
             ignore, size, name = line.split("/")
             # negate size so 'largest' is sorted first
             fileinfo.append( (-long(size), dirid, name) )
             if len(fileinfo) > 10000:
                 # Could use a heapq....
                 fileinfo.sort()
                 fileinfo = fileinfo[:200]
         else:
             ignore, dirname, parent_id, dirid = line[:-1].split("/")
             dirid_info[dirid] = (parent_id, dirname)

     fileinfo.sort()
     fileinfo = fileinfo[:200]

     for size, dirid, name in fileinfo:
         size = -size
         components = [name[:-1]]  # need to chop newline
         while dirid != None:
             dirid, dirname = dirid_info[dirid]
             components.append(dirname)
         components.reverse()
         print size, "/".join(components)

def test():
     import cStringIO
     s = """\
T /remote 0
S/name/0/1
S/joe/1/2
S/bob/1/3
F/3150900/big_file.tar.gz
S/testing/3/4
F/414/.envrc
F/276/BUILD_FLAGS
F/36505/make.incl
F/3861/build_envrc
D/spam/1/5
F/123456789012345678/really_quite_a_bit_of_spam
"""
     f = cStringIO.StringIO(s)
     process(f)

if __name__ == "__main__":
     test()

Here's the output from the test run

123456789012345678 /remote/name/spam/really_quite_a_bit_of_spam
3150900 /remote/name/bob/big_file.tar.gz
36505 /remote/name/bob/testing/make.incl
3861 /remote/name/bob/testing/build_envrc
414 /remote/name/bob/testing/.envrc
276 /remote/name/bob/testing/BUILD_FLAGS



					Andrew
					dalke at dalkescientific.com





More information about the Python-list mailing list