Large File Parsing

Mon Jun 16 00:13:51 EDT 2003

Robert S Shaffer <r.shaffer9 at verizon.net> writes:
> I have upto a 3 million record file to parse, remove duplicates and
> sort by size then numeric value. Is this the best way to do this in
> python. The key is the first column and the ,xx needs removed.

Your script is reasonable if you have enough memory to run it over
your input files.  If not, simplest is probably to filter

  1234567,12
  123456789012,12

into

  10,1234567,12
  15,123456789012,12

where the leading number you prepend is the length of the line.  Then
sort with the Unix sort utility (which does an external sort if the
input is big enough to need it), then filter again to remove the
prepended length.