removing duplication from a huge list.

Fri Feb 27 12:07:41 EST 2009

Falcolas wrote:
> On Feb 27, 8:33 am, Tim Rowe <digi... at gmail.com> wrote:
>> 2009/2/27 odeits <ode... at gmail.com>:
>>
>>> How big of a list are we talking about? If the list is so big that the
>>> entire list cannot fit in memory at the same time this approach wont
>>> work e.g. removing duplicate lines from a very large file.
>> We were told in the original question: more than 15 million records,
>> and it won't all fit into memory. So your observation is pertinent.
>>
>> --
>> Tim Rowe
> 
> If order did matter, and the list itself couldn't be stored in memory,
> I would personally do some sort of hash of each item (or something as
> simple as first 5 bytes, last 5 bytes and length), keeping a reference
> to which item the hash belongs, sort and identify duplicates in the
> hash, and using the reference check to see if the actual items in
> question match as well.
> 
Assuming no duplicates, how does this help? You still have to verify
collisions.

> Pretty brutish and slow, but it's the first algorithm which comes to
> mind. Of course, I'm assuming that the list items are long enough to
> warrant using a hash and not the values themselves.
> 
The problem is you can't guarantee any hash value will be unique, so
ultimately you have to check whether list entries hashing to the same
value are or are not equal. Which would mean either having the whole
list in memory or having access to it via some other random access method.

regards
 Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/