Orders of magnitude
Josiah Carlson
jcarlson at nospam.uci.edu
Mon Mar 29 03:56:35 EST 2004
> Sort, then remove dups.
A list 10 million integers suck up ~160 megs of memory with Python. I
doubt the strings would fit even then.
I would suggest a multi-phase method. Break the sequence up into 100k
element blocks each and remove duplicates using dictionaries. Sort each
block individually and write each to a file.
Merge the 100 files by using a heapq.
#heap is structured like...
[("next string in file a", file_handle_a),
("next string in file b", file_handle_b),...]
If you keep using heappop and heappush from heapq, along with a single
string of memory, you can remove duplicates quite easily, and certainly
won't run out of memory.
- Josiah
More information about the Python-list
mailing list