Orders of magnitude

Josiah Carlson jcarlson at nospam.uci.edu
Mon Mar 29 03:56:35 EST 2004


> Sort, then remove dups.

A list 10 million integers suck up ~160 megs of memory with Python.  I 
doubt the strings would fit even then.

I would suggest a multi-phase method.  Break the sequence up into 100k 
element blocks each and remove duplicates using dictionaries.  Sort each 
block individually and write each to a file.

Merge the 100 files by using a heapq.
#heap is structured like...
[("next string in file a", file_handle_a),
  ("next string in file b", file_handle_b),...]

If you keep using heappop and heappush from heapq, along with a single 
string of memory, you can remove duplicates quite easily, and certainly 
won't run out of memory.

  - Josiah



More information about the Python-list mailing list