Graham's spam filter
Christopher Browne
cbbrowne at acm.org
Thu Sep 5 16:24:18 EDT 2002
Quoth Erik Max Francis <max at alcyone.com>:
> Christopher Browne wrote:
>> Have you considered simply replacing strings that appear to be
>> base64-encoded with a token like "base64-text"?
>>
>> That allows the database to at least be aware that the spam
>> commonly contains base64 data.
>
> Well, that really depends on what your goal is. Again, if you're
> one of those people that has a very tight circle of email buddies
> and so essentially any unsolicited email is by definition spam, then
> you can tighten down your spam filter in all kinds of very powerful
> ways.
>
> I, as I've mentioned before, receive unsolicited email from my Web
> sites and various projects, and so unfortunately don't have the
> luxury of doing this. So I need to support receiving email from
> faraway lands and unknown email addresses as well as trying to
> vigorously filter spam.
>
> Fact is, unfortunately, lots of people send legitimate email that is
> MIME encoded.
No, I am certainly _not_ defining "all unsolicited email" as being
spam. Quite to the contrary, I receive quite a lot of interesting
email from unexpected sources. Very little of it, statistically
speaking, is heavily MIME encoded, mind you...
The MIME encoded stuff does _not_ solely consist of "base64" text; it
also has header information that at least _suggests_ file type info.
A recent virus email contained:
Content-Type: application/octet-stream;
name=snoopy.exe
Content-Transfer-Encoding: base64
Content-ID: <UCuk0QbULj2h8t9F>
another had:
Content-Type: audio/x-wav;
name=bgcolor.exe
Content-Transfer-Encoding: base64
Content-ID: <W0H1pml7>
I get legitimate mail that contains base64 material; it _never_, in my
experience, consists solely of base64 material.
It always contains _some_ sort of commentary, and whether that
commentary came as text or as HTML, it's quite nicely sufficient to
distinguish the "unexpected resumes from Russia" from the email
viruses.
>> -> Supposing there is interesting text encoded (such as source code
>> for a virus) inside the base64 stuff, it _would_ be useful to
>> decode it;
>>
>> -> Supposing the base64 stuff is basically just a GIF/JPEG/PNG, or
>> something else that doesn't contain "interesting text," you'll
>> have not much of value from the decoding process.
>>
>> Making the "tokenizing" step a tad smarter (e.g. - recognizing "this
>> is likely base 64" and collecting stats on numbers of lines of base64
>> material) requires minimal added effort, and I expect it would buy you
>> _most_ of the benefits of decoding.
>
> Spammers are hitting upon the strategy, though, of sending emails in
> which the body consists of nothing but a completely encoded base64
> MIME part. So in that case, the entire body of your message would
> consist solely of your "base64encoded" token. So in the general
> case of any kind of spam filter (not just limited to a Graham
> filter), it's questionable how useful this will be, unless you plant
> to always filter against that token, presuming it to always indicate
> spam.
I've been using naive Bayesian filtering for years; I don't assume
that _any_ particular token indicates _any_ particular result.
I'm not interested in the "rule-based" stuff, only in the schemes
based on statistical analysis.
And the body of the message would most certainly NOT consist solely of
a "base64encoded" token.
The body portion would consist of:
- Various "Content-foo" tokens
- The header information that these documents _do_ contain; they
normally contain an HTML header.
- Not "solely a base64encoded token," but rather some sort of count
involving _many_ base64encoded tokens.
The notion that it's "solely one token" is in your imagination, not in
reality. There are _no_ "presumptions" being made here.
What I'm saying, that apparently isn't being read, is that I expect
that collecting stats on the numbers of "base64 lines" is likely to be
_nearly_ as useful as decoding the contents, and that it's _certainly_
simpler and faster.
If it _proves_ insufficient as a discriminator (please feel free to
direct any nonsense about 'presuming anything to always indicate
spam' to /dev/null), then it might prove necessary to _try_ to analyze
the contents.
_Trying_ to decode and analyze the contents may still prove a futile
exercise. You won't get much useful material out of such common MIME
contents as graphics, PDFs, ZIP files, and audio files, without going
to even _more_ gratuitous lengths to analyze them that might very well
make you vulnerable to DOS attacks directed against the mail filter
itself.
--
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/nonrdbms.html
"I will not send lard through the mail" ^ 100 -- Bart Simpson
More information about the Python-list
mailing list