removing duplication from a huge list.

Fri Feb 27 15:31:51 EST 2009

Falcolas wrote:
> On Feb 27, 10:07 am, Steve Holden <st... at holdenweb.com> wrote:
>> Assuming no duplicates, how does this help? You still have to verify
>> collisions.
> 
> Absolutely. But a decent hashing function (particularly since you know
> quite a bit about the data beforehand) will have very few collisions
> (theoretically no collisions, again since we know the data
> beforehand).
> 
>> The problem is you can't guarantee any hash value will be unique, so
>> ultimately you have to check whether list entries hashing to the same
>> value are or are not equal. Which would mean either having the whole
>> list in memory or having access to it via some other random access method.
> 
> Again, absolutely. But as Tim pointed out, with a decent hash the
> number of superfluous checks should be minimal, so your checks will
> mostly occur on valid duplicate values.
> 
> It's worth noting that I picked this algorithm with the assumption
> that any storage to physical memory (such as required by using a set)
> would be out of the question.
> 
I take your point about the relatively rare collisions. Devising a
zero-collision algorithm is far from trivial, since at the minimum it
requires storing all hashes in a set to verify each new hash doesn't
collide with previous ones.

But unless you can *guarantee* zero collisions you do still need a
mapping from hash values to the original data, since this is the only
way to determine whether collisions represent a duplicate value or not.
That's not a trivial detail for fifteen million values, though it
depends some on the format of the OP's "records". I suppose a list of
file positions stored with each hash might be usable assuming either
fixed-length or properly delimited records.

regards
 Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/