Graham's spam filter

Thu Aug 22 23:54:33 EDT 2002

Quoth Paul Rubin <phr-n2002b at NOSPAMnightsong.com>:
> Christopher Browne <cbbrowne at acm.org> writes:
>> > You can't skip base64-encoded stuff since a lot of it is spam.  You
>> > have to decode it and filter it.
>> 
>> Ah, but the fact that there's a chunk of base64-encoded material is a
>> piece of data.  Create a 'base64' element, and count it.  Works like a
>> charm.  (Throw it away, and you're left with little more than header
>> data, which is also Statistically Highly Significant, which _also_
>> works like a charm.)
>> 
>> There's lots about this that _isn't_ intuitively obvious unless you
>> think very carefully about the math...
>
> I don't understand this.  If you can classify spam based on just the
> headers, there'd be no point to filtering the content, so we
> wouldn't be talking about text corpi.  You have to filter on content
> as well.
>
> And if you're going to filter content, you have to realize some
> messages will be base64-encoded, and of those base64 messages, some
> will be spam and others will be non-spam.  The idea of a spam filter
> is to figure out which are which.  It can't do that without decoding
> and examining them.

Killer question: Do you get non-spam where the content is all
base64-encoded?

The only mail _I_ get like that is when my brother is emailing out
baby pictures, and guess what?  The presence of headers that have
Dave's "fingerprints" all over them is enough to indicate that it's
"good mail," from him.

And I also have a folder full of "virus" messages, mostly
base64-encoded, and it classifies _very_ nicely based on there being
headers, possibly some MIME information, and essentially _nothing_ in
the body that gets kept.

Remember, the hope is that a message consisting of nothing more than
the following should be identified as spam:

  From: someone at spammers.com
  To: cbbrowne at hex.net
  Subject: Important News

  Please look at my web site at http://spam.me.silly/ 

Something looking like that is almost certainly spam.  The equivalent
message that has a huge HTML page with a "hidden" barrel of ECMAScript
is also spam, and if you chop out the base64 stuff, you'll pretty much
have the above "clearly spam" message.

The fact that it _doesn't_ contain words commonly used when people
want to _communicate with me_ means that it looks like spam.  If there
was some base64 material that we simply lopped off, that doesn't
change things.

Indeed, if we replaced the base64 material with the string "base64",
thus indicating that there was _something_ there, that is also likely
quite illuminating, statistically, irrespective of whether or not you
try to further analyze the contents.

It's just like doing "traffic analysis."  If a lot of encrypted
personal messages are going from staff officers to their families at a
particular time, it is likely that a military operation is getting
under way.  This is called the "underwear effect." [1]  

I'm _fine_ with using the "underwear effect" to detect spam.  It's
certainly good enough to provide decent correlation.

And if the base64 contents turns out to be a JPEG, what was the
pattern you were planning to draw out of that anyways?  The _most_
that you'd be able to expect to get out of it is to know what kind of
file it is.  And MIME headers that you can already decode likely
already throw that into the corpus.

[1] See _Decrypted Secrets_, F.L. Bauer, section 11.1.4, page 200.
-- 
(reverse (concatenate 'string "moc.enworbbc@" "sirhc"))
http://www3.sympatico.ca/cbbrowne/rdbms.html
"And  1.1.81 is  officially BugFree(tm),  so  if you  receive any  bug
reports on it, you know they are just evil lies." -- Linus Torvalds