Large File Parsing

Mon Jun 16 03:00:29 EDT 2003

<posted & mailed>

Robert S Shaffer wrote:

> I have upto a 3 million record file to parse, remove duplicates and
> sort by size then numeric value. Is this the best way to do this in
> python. The key is the first column and the ,xx needs removed.
> 
> 1234567,12
> 123456789012,12

No, the approach you outline later cannot possibly be "the best way"
(in terms of speed) because you're passing a comparison function to
the sort method, and that will undoubtedly slow things down a lot.
For speed, use the Decorate-Sort-Undecorate (DSU) idiom that is well
covered in the Sorting and Searching chapter of the Python Cookbook
and more briefly in Python in a Nutshell.  In your case, assuming you
do have enough physical memory to support this:

def run():

    lines = open('input.dat')
    dictfile = {}
    for eachline in lines:
        field0 = eachline.split(',',1)[0]
        key = len(field0), field0
        dictfile[key] = 1
    lines.close()

    allkeys = dictfile.keys()
    del dictfile
    allkeys.sort()

    oufile = open('output.dat', 'w')
    for junk, field0 in allkeys:
        oufile.write(field0)
        oufile.write('\n')
    oufile.close()

Several variants are possible (e.g. you could use the loop on
allkeys just to prepare a list of strings, and then call
oufile,writelines once to emit it) but -- always assuming
sufficient physical memory -- I would not expect drastically
different performance among them.  Still, intuition often
plays tricks about performance, so you may want to code the
various possibility and measure actual speed in your case.

Alex