Sorting Large File (Code/Performance)

Paul Rubin http
Thu Jan 24 14:41:46 EST 2008


Ira.Kovac at gmail.com writes:
> I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
> to sort based on first two characters.
> 
> I'd greatly appreciate if someone can post sample code that can help
> me do this.

Use the unix sort command:

   sort inputfile -o outputfile 

I think there is a cygwin port.

> Also, any ideas on approximately how long is the sort process going to
> take (XP, Dual Core 2.0GHz w/2GB RAM).

Eh, unix sort would probably take a while, somewhere between 15
minutes and an hour.  If you only have to do it once it's not worth
writing special purpose code.  If you have to do it a lot, get some
more ram for that box, suck the file into memory and do a radix sort.



More information about the Python-list mailing list