Sorting in huge files

Paul paulolivier at gmail.com
Tue Dec 7 15:27:33 EST 2004


Hi all

I have a sorting problem, but my experience with Python is rather
limited (3 days), so I am running this by the list first.

I have a large database of 15GB, consisting of 10^8 entries of
approximately 100 bytes each. I devised a relatively simple key map on
my database, and I would like to order the database with respect to the
key.

I expect a few repeats for most of the keys, and that s actually part
of what I want to figure out in the end. (Said loosely, I want to group
all the data entries having "similar" keys. For this I need to sort the
keys first (data entries having _same_ key), and then figure out which
keys are "similar").

A few thoughts on this:
- Space is not going to be an issue. I have a Tb available.
- The Python sort() on list should be good enough, if I can load the
whole database into a list/dict
- each data entry is relatively small, so I shouldn't use pointers
- Keys could be strings, integers with the usual order, whatever is
handy, it doesn't matter to me. The choice will probably have to do
with what sort() prefers.
- Also I will be happy with any key space size. So I guess 100*size of
the database will do.

Any comments?
How long should I hope this sort will take? It will sound weird, but I
actually have 12 different key maps and I want to sort this with
respect to each map, so I will have to sort 12 times.

Paul




More information about the Python-list mailing list