removing duplication from a huge list.
Paul Rubin
http
Fri Feb 27 17:18:36 EST 2009
Tim Rowe <digitig at gmail.com> writes:
> We were told in the original question: more than 15 million records,
> and it won't all fit into memory. So your observation is pertinent.
That is not terribly many records by today's standards. The knee-jerk
approach is to sort them externally, then make a linear pass skipping
the duplicates. Is the exercise to write an external sort in Python?
It's worth doing if you've never done it before.
More information about the Python-list
mailing list