Large File Parsing

Sun Jun 15 23:54:21 EDT 2003

On Sun, 15 Jun 2003, Robert S Shaffer <r.shaffer9 at verizon.net> wrote:

> I have upto a 3 million record file to parse, remove duplicates and
> sort by size then numeric value. Is this the best way to do this in
> python. The key is the first column and the ,xx needs removed.
> 
> 1234567,12
> 123456789012,12
<snip>
> def filesort( input1,input2 ):
>     if (len (input1) > len( input2 )):
>         return 1
>     if (len (input1) < len( input2 )):
>         return -1
>     return cmp(input1,input2)

Always avoid writing your own cmp() function.

Here are three examples of such ways, with increasing generality:

def sortedCopy(fin, fout):
    """usable if all values are integers without leading zeroes"""
    d = {}
    for line in fin:
        val, _ = line.split(',', 1)
        d[int(val)] = 1
    k = d.keys()
    k.sort()
    fout.writelines([str(val)+'\n' for val in k])

def sortedCopy(fin, fout):
    """usable if all values are integers (with or without leading zeroes)"""
    # uses a hack: instead of trying to treat 0012 especially, put a leading
    # "1" everywhere. Trick I learned from a Peter Norvig paper.
    d = {}
    for line in fin:
        val, _ = line.split(',', 1)
        d[int('1'+val)] = 1
    k = d.keys()
    k.sort()
    fout.writelines([str(val)[1:]+'\n' for val in k])

def sortedCopy(fin, fout):
    """usable for any strings"""
    d = {}
    for line in fin:
        val, _ = line.split(',', 1)
        d[len(val), val] = 1
    k = d.keys()
    k.sort()
    fout.writelines([val+'\n' for _, val in k])
-- 
Moshe Zadka -- http://moshez.org/
Buffy: I don't like you hanging out with someone that... short.
Riley: Yeah, a lot of young people nowadays are experimenting with shortness.
Agile Programming Language -- http://www.python.org/