removing duplication from a huge list.

Paul Rubin http
Fri Feb 27 17:18:36 EST 2009


Tim Rowe <digitig at gmail.com> writes:
> We were told in the original question: more than 15 million records,
> and it won't all fit into memory. So your observation is pertinent.

That is not terribly many records by today's standards.  The knee-jerk
approach is to sort them externally, then make a linear pass skipping
the duplicates.  Is the exercise to write an external sort in Python?
It's worth doing if you've never done it before.



More information about the Python-list mailing list