Sorting Large File (Code/Performance)

Thu Jan 24 15:44:00 EST 2008

On Jan 25, 6:18 am, Ira.Ko... at gmail.com wrote:
> Hello all,
>
> I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
> to sort based on first two characters.

If you mean 1.6 American billion i.e. 1.6 * 1000 ** 3 lines, and 2 *
1024 ** 3 bytes of data, that's 1.34 bytes per line. If you mean other
definitions of "billion" and/or "GB", the result is even fewer bytes
per line.

What is a "Unicode text file"? How is it encoded: utf8, utf16,
utf16le, utf16be, ??? If you don't know, do this:

print repr(open('the_file', 'rb').read(100))

and show us the results.

What does "based on [the] first two characters" mean? Do you mean raw
order based on the ordinal of each character i.e. no fancy language-
specific collating sequence? Do the first two characters always belong
to the ASCII subset?

You'd like to sort a large file? Why? Sorting a file is just a means
to an end, and often another means is more appropriate. What are you
going to do with it after it's sorted?

> I'd greatly appreciate if someone can post sample code that can help
> me do this.

I'm sure you would. However it would benefit you even more if instead
of sitting on the beach next to the big arrow pointing to the drop
zone, you were to read the manual and work out how to do it yourself.
Here's a start: http://docs.python.org/lib/typesseq-mutable.html

> Also, any ideas on approximately how long is the sort process going to
> take (XP, Dual Core 2.0GHz w/2GB RAM).

If you really have a 2GB file and only 2GB of RAM, I suggest that you
don't hold your breath.

Instead of writing Python code, you are probably better off doing an
external sort. You might consider looking for a Windows port of a
Unicode-capable Unix sort utility. Google "GnuWin32" and see if their
sort does what you want.