removing duplication from a huge list.

Fri Feb 27 15:01:29 EST 2009

On Feb 27, 10:07 am, Steve Holden <st... at holdenweb.com> wrote:
> Assuming no duplicates, how does this help? You still have to verify
> collisions.

Absolutely. But a decent hashing function (particularly since you know
quite a bit about the data beforehand) will have very few collisions
(theoretically no collisions, again since we know the data
beforehand).

> The problem is you can't guarantee any hash value will be unique, so
> ultimately you have to check whether list entries hashing to the same
> value are or are not equal. Which would mean either having the whole
> list in memory or having access to it via some other random access method.

Again, absolutely. But as Tim pointed out, with a decent hash the
number of superfluous checks should be minimal, so your checks will
mostly occur on valid duplicate values.

It's worth noting that I picked this algorithm with the assumption
that any storage to physical memory (such as required by using a set)
would be out of the question.

~G