[Spambayes] FW: [Spambayes-checkins] spambayes tokenizer.py,1.57,1.58

Tim Peters tim.one@comcast.net
Thu Oct 31 16:40:14 2002


[Tim]
>> A new mini-phase of body tokenization scours HTML for common
>> virus clues, variations of
>>
>>     <script    </script
>>     <iframe    </iframe
>>     src=cid:
>>     height=0   width=0

[Guido]
> This gets us awfully close to SA's "precompiled list of clues to look
> for" approach. :-(

We're throwing away *all* HTML tags now, and missing a lot of info because
of that.  As I said about this one, virus/worm msgs of this nature often
have no other content period.  The classifier can't score what it can't see.

Feel free to design a principled approach to tokenizing HTML tags that still
allows some HTML messages to avoid getting called spam.  In the absence of
that, I've got no qualms about adding special cases that help.  For goodness
sake, it was a massive special-case hack to *strip* HTML tags to begin
with -- think of this as a minor unhack of that <wink>.