random writing access to a file in Python
Tim Chase
python.list at tim.thechases.com
Fri Aug 25 17:54:55 EDT 2006
> Is there a ready to use (free, best Open Source) tool able to sort lines
> (each line appr. 20 bytes long) of a XXX GByte large text file (i.e. in
> place) taking full advantage of available memory to speed up the process
> as much as possible?
Sounds like an occasion to use a merge-sort. The pseudo-code
would be:
break up the file into bite-sized chunks (maybe a couple megs
each).
Sort each of them linewise.
Write them out to intermediate files
Once you have these pieces, open each file
read the first line of each one.
[here] Find the "earliest" of each of those lines according to
your sort-order.
write it to your output file
read the next line from that particular file
return to [here]
There are some optimizations that can be had on this as
well...you can find the "earliest" *and* the "next earliest" of
those lines/files, and just read from the "earliest" file until
the current line of it passes "next earliest"...lather, rinse,
repeat shifting "next earliest" to be the "earliest" and then
find the new "next earliest".
I don't know if the GNU "sort" utility does this, but I've thrown
some rather large files at it and haven't choked it yet.
-tkc
More information about the Python-list
mailing list