Sorting in huge files

Larry Bates lbates at syscononline.com
Tue Dec 7 17:24:32 EST 2004


Paul,

I can pretty much promise you that it you really have 10^8
records they should be put into a database and let the database
do the sorting by creating indexes on the fields that you want.
Something like MySQL should do nicely and is free.

http://www.mysql.org

Python has good interface to mysql database if you want other
processing on the records.

The alternative is a good high-speed sort like Syncsort, etc.

Good Luck,
Larry Bates



Paul wrote:
> Hi all
> 
> I have a sorting problem, but my experience with Python is rather
> limited (3 days), so I am running this by the list first.
> 
> I have a large database of 15GB, consisting of 10^8 entries of
> approximately 100 bytes each. I devised a relatively simple key map on
> my database, and I would like to order the database with respect to the
> key.
> 
> I expect a few repeats for most of the keys, and that s actually part
> of what I want to figure out in the end. (Said loosely, I want to group
> all the data entries having "similar" keys. For this I need to sort the
> keys first (data entries having _same_ key), and then figure out which
> keys are "similar").
> 
> A few thoughts on this:
> - Space is not going to be an issue. I have a Tb available.
> - The Python sort() on list should be good enough, if I can load the
> whole database into a list/dict
> - each data entry is relatively small, so I shouldn't use pointers
> - Keys could be strings, integers with the usual order, whatever is
> handy, it doesn't matter to me. The choice will probably have to do
> with what sort() prefers.
> - Also I will be happy with any key space size. So I guess 100*size of
> the database will do.
> 
> Any comments?
> How long should I hope this sort will take? It will sound weird, but I
> actually have 12 different key maps and I want to sort this with
> respect to each map, so I will have to sort 12 times.
> 
> Paul
> 



More information about the Python-list mailing list