random writing access to a file in Python

Tim Chase python.list at tim.thechases.com
Fri Aug 25 17:54:55 EDT 2006


> Is there a ready to use (free, best Open Source) tool able to sort lines 
> (each line appr. 20 bytes long) of a XXX GByte large text file (i.e. in 
> place) taking full advantage of available memory to speed up the process 
> as much as possible?

Sounds like an occasion to use a merge-sort.  The pseudo-code 
would be:

break up the file into bite-sized chunks (maybe a couple megs 
each).
	Sort each of them linewise.
	Write them out to intermediate files

Once you have these pieces, open each file

read the first line of each one.

[here] Find the "earliest" of each of those lines according to 
your sort-order.
	write it to your output file
	read the next line from that particular file
	return to [here]

There are some optimizations that can be had on this as 
well...you can find the "earliest" *and* the "next earliest" of 
those lines/files, and just read from the "earliest" file until 
the current line of it passes "next earliest"...lather, rinse, 
repeat shifting "next earliest" to be the "earliest" and then 
find the new "next earliest".

I don't know if the GNU "sort" utility does this, but I've thrown 
some rather large files at it and haven't choked it yet.

-tkc






More information about the Python-list mailing list