Graham's spam filter

Christopher Browne cbbrowne at acm.org
Thu Aug 22 21:53:37 EDT 2002


The world rejoiced as Erik Max Francis <max at alcyone.com> wrote:
> Neale Pickett wrote:
>
>> One thing you *should* do, though, is skip base64-encoded stuff.  That
>> will just clutter up your database.
>
> This is another example of where such a database won't work.  A fair
> number of (clever) spammers send the entire body of their message
> encoded with base64, so that none of it is readable text.  The spam
> filter either has to have some special rules for this case, or decode
> the base64 data and then proceed with that.  Otherwise, just by ignoring
> base64 data, an otherwise apparently innocuous message could easily get
> through the filter.

Two reasonable options present themselves that _don't_ involve any
decoding:

 a) Throw the Base64 data away.  

 You're left with headers and the stuff "around" the base64 material.
 If other spam messages look the same as this (and they do), it'll
 filter nicely without any special attention.

 This is what Ifile does, and the base64-encoded viruses head to
 Spam/Viruses with _great_ efficiency.

 b) Turn the Base64 into a "Base 64 element."

 You're throwing away the content, but keeping the fact that there was
 a chunk of base64 content.  This is likely to be _slightly_ better
 than a).  After all, spam is likely to have a higher than average
 incidence of "47 lines of BASE64 content" than material from the
 Python discussion list.
-- 
(concatenate 'string "cbbrowne" "@cbbrowne.com")
http://cbbrowne.com/info/ifilter.html
There are many intelligent species in the universe.  They are all
owned by cats.



More information about the Python-list mailing list