[Spambayes] Re: [SAtalk] spampot -- spam honeypot server (fwd)
Justin Mason
jm at jmason.org
Tue Jan 21 11:20:53 EST 2003
Matt Sergeant said:
> My guess is you'd need to put some sort of Razor-like signature
> checking in place (perhaps using Pyzor) to remove dupes.
Actually, I have some rough-but-working-well-enough perl code in
SpamAssassin CVS, in the "masses/corpora" dir, which does this.
"fuzzy-hash-maildir" is the script in question. Here's how it works:
- for each mail:
- strip all HTML tags
- strip text in "quotes" -- vars in javascript, etc.
- remove words with ? marks inside them, possible encoded mail addrs
- remove words with @ marks inside them, possible encoded mail addrs
- remove lines that contain just a single string of non-white chars,
possible hash busters or encoded mail addrs
- split into an array of lines (NOT bytes, since spammers are using
variable-length hash-busting strings)
- divide into 4 blocks and hash them: hash1, hash2, hash3, hash4
- output into associative arrays as
hash1.hash2 -> filename
hash1.hash2.hash3 -> filename
hash1.hash2.hash3.hash4 -> filename
(should probably use e.g. hash2.hash3.hash4 as well. Note that
hashbusters and encoded addrs generally appear in the first and/or
last blocks.)
- finally check those arrays for collisions and output these as "likely
dups".
It works sufficiently well. ;)
--j.
More information about the Spambayes
mailing list