Graham's spam filter

Christopher Browne cbbrowne at acm.org
Thu Aug 22 21:53:38 EDT 2002


In an attempt to throw the authorities off his trail, Paul Rubin <phr-n2002b at NOSPAMnightsong.com> transmitted:
> Neale Pickett <neale at woozle.org> writes:
>> One thing you *should* do, though, is skip base64-encoded stuff.  That
>> will just clutter up your database.

> You can't skip base64-encoded stuff since a lot of it is spam.  You
> have to decode it and filter it.

Ah, but the fact that there's a chunk of base64-encoded material is a
piece of data.  Create a 'base64' element, and count it.  Works like a
charm.  (Throw it away, and you're left with little more than header
data, which is also Statistically Highly Significant, which _also_
works like a charm.)

There's lots about this that _isn't_ intuitively obvious unless you
think very carefully about the math...
-- 
(concatenate 'string "aa454" "@freenet.carleton.ca")
http://www.ntlug.org/~cbbrowne/sgml.html
"Fear leads to anger. Anger leads to hate. Hate leads to using Windows
NT for mission-critical applications."  --- What Yoda *meant* to say



More information about the Python-list mailing list