Large File Parsing
Moshe Zadka
m at moshez.org
Sun Jun 15 23:54:21 EDT 2003
On Sun, 15 Jun 2003, Robert S Shaffer <r.shaffer9 at verizon.net> wrote:
> I have upto a 3 million record file to parse, remove duplicates and
> sort by size then numeric value. Is this the best way to do this in
> python. The key is the first column and the ,xx needs removed.
>
> 1234567,12
> 123456789012,12
<snip>
> def filesort( input1,input2 ):
> if (len (input1) > len( input2 )):
> return 1
> if (len (input1) < len( input2 )):
> return -1
> return cmp(input1,input2)
Always avoid writing your own cmp() function.
Here are three examples of such ways, with increasing generality:
def sortedCopy(fin, fout):
"""usable if all values are integers without leading zeroes"""
d = {}
for line in fin:
val, _ = line.split(',', 1)
d[int(val)] = 1
k = d.keys()
k.sort()
fout.writelines([str(val)+'\n' for val in k])
def sortedCopy(fin, fout):
"""usable if all values are integers (with or without leading zeroes)"""
# uses a hack: instead of trying to treat 0012 especially, put a leading
# "1" everywhere. Trick I learned from a Peter Norvig paper.
d = {}
for line in fin:
val, _ = line.split(',', 1)
d[int('1'+val)] = 1
k = d.keys()
k.sort()
fout.writelines([str(val)[1:]+'\n' for val in k])
def sortedCopy(fin, fout):
"""usable for any strings"""
d = {}
for line in fin:
val, _ = line.split(',', 1)
d[len(val), val] = 1
k = d.keys()
k.sort()
fout.writelines([val+'\n' for _, val in k])
--
Moshe Zadka -- http://moshez.org/
Buffy: I don't like you hanging out with someone that... short.
Riley: Yeah, a lot of young people nowadays are experimenting with shortness.
Agile Programming Language -- http://www.python.org/
More information about the Python-list
mailing list