removing duplication from a huge list.

Fri Feb 27 17:18:36 EST 2009

Tim Rowe <digitig at gmail.com> writes:
> We were told in the original question: more than 15 million records,
> and it won't all fit into memory. So your observation is pertinent.

That is not terribly many records by today's standards.  The knee-jerk
approach is to sort them externally, then make a linear pass skipping
the duplicates.  Is the exercise to write an external sort in Python?
It's worth doing if you've never done it before.