[Spambayes] deleting "duplicate" spam before training? good idea or bad?

Skip Montanaro skip@pobox.com
Mon, 9 Sep 2002 11:31:12 -0500


Because I get mail through several different email addresses, I frequently
get duplicates (or triplicates or more-plicates) of various spam messages.
In saving spam for later analysis I haven't always been careful to avoid
saving such duplicates.

I wrote a script some time ago to try an minimize the duplicates I see by
calculating a loose checksum, but I still have some duplicates.  Should I
delete the duplicates before training or not?  Would people be interested in
the script?  I'd be happy to extricate it from my local modules and check it
into CVS.

Skip