Graham's spam filter

Thu Sep 5 11:33:24 EDT 2002

Centuries ago, Nostradamus foresaw when Erik Max Francis <max at alcyone.com> would write:
> Aaron Swartz wrote:
>
>> I've been using bogofilter[1], Eric Raymond's Graham-derived spam
>> filter which threw away base64-encoded data and 90% of all spam that
>> got past the filter was base64-encoded. Therefore, I think that base64
>> content really needs to be decoded. I wrote a base64-decoding filter
>> in Python for it and the problem has gone away.
>
> Indeed.  I've been finding very much the same thing with my rule-based
> filter; about 90% of the spam that's getting through is base64 encoded. 
> I haven't yet taken the next step of automatically decoding the base64
> text parts (and then just processing that), but as you have discovered
> it is an obvious solution to the obvious problem.

Have you considered simply replacing strings that appear to be
base64-encoded with a token like "base64-text"?

That allows the database to at least be aware that the spam commonly
contains base64 data.

-> Supposing there is interesting text encoded (such as source code
   for a virus) inside the base64 stuff, it _would_ be useful to
   decode it; 

-> Supposing the base64 stuff is basically just a GIF/JPEG/PNG, or
   something else that doesn't contain "interesting text," you'll
   have not much of value from the decoding process.

Making the "tokenizing" step a tad smarter (e.g. - recognizing "this
is likely base 64" and collecting stats on numbers of lines of base64
material) requires minimal added effort, and I expect it would buy you
_most_ of the benefits of decoding.
-- 
(concatenate 'string "chris" "@cbbrowne.com")
http://cbbrowne.com/info/linux.html
Let me control a planet's oxygen supply and I don't care who makes the
laws.