removing duplication from a huge list.

Tue Mar 3 16:09:40 EST 2009

On Feb 27, 9:55 am, Falcolas <garri... at gmail.com> wrote:
> If order did matter, and the list itself couldn't be stored in memory,
> I would personally do some sort of hash of each item (or something as
> simple as first 5 bytes, last 5 bytes and length), keeping a reference
> to which item the hash belongs, sort and identify duplicates in the
> hash, and using the reference check to see if the actual items in
> question match as well.
>
> Pretty brutish and slow, but it's the first algorithm which comes to
> mind. Of course, I'm assuming that the list items are long enough to
> warrant using a hash and not the values themselves.

Might as well move all the duplication checking to sqlite.

Although it seems tempting to stick a layer in front, you will always
require either a full comparison or a full update, so there's no
potential for a fast path.