Graham's spam filter
Christopher Browne
cbbrowne at acm.org
Thu Aug 22 21:53:38 EDT 2002
In an attempt to throw the authorities off his trail, Paul Rubin <phr-n2002b at NOSPAMnightsong.com> transmitted:
> Neale Pickett <neale at woozle.org> writes:
>> One thing you *should* do, though, is skip base64-encoded stuff. That
>> will just clutter up your database.
> You can't skip base64-encoded stuff since a lot of it is spam. You
> have to decode it and filter it.
Ah, but the fact that there's a chunk of base64-encoded material is a
piece of data. Create a 'base64' element, and count it. Works like a
charm. (Throw it away, and you're left with little more than header
data, which is also Statistically Highly Significant, which _also_
works like a charm.)
There's lots about this that _isn't_ intuitively obvious unless you
think very carefully about the math...
--
(concatenate 'string "aa454" "@freenet.carleton.ca")
http://www.ntlug.org/~cbbrowne/sgml.html
"Fear leads to anger. Anger leads to hate. Hate leads to using Windows
NT for mission-critical applications." --- What Yoda *meant* to say
More information about the Python-list
mailing list