[Spambayes] idea for tokenizer.crack_filename change

Tim Peters tim.one@comcast.net
Sun, 22 Sep 2002 00:46:00 -0400


[Skip Montanaro]
>>> It seems to me that base64-encoded, all DOS/Windows executables start
>>> with (reciting from memory, since I've deleted all viruses and
>>> haven't received any new ones in the last 15 minutes or so) "TPqAAA"
>>> or something similar.  Why rely on finding specific file extensions?
>>> They can just change.

[Tim]
>> Well, not often, and the scheme we're working on is supposed to be
>> able tolearn when they do <wink>.  Would you like to write some
>> code to tokenize this particular bit of Windows Lore?

[Skip]
> I gave it a try, but I'm still suffering with fp/fn rates around 15%,

Yuck.  Weren't you going to share some detailed output?  Something isn't
right, but you've already guessed that <wink>.

> so anything I see is suspect.  Also, I saw no change.  It's quite
> possible I have a bug, but I've also cleaned out obvious viruses
> from my corpora.

Then is there any reason to hope that you *would* see a change?

> True spam may have enough indicators elsewhere that this scheme
> won't help.

Note that no single indicator carries much weight under this scheme (whether
Paul's or Gary's doesn't matter -- one clue is just one clue, and I expect
that makes a rule-based scheme (like Greg's) more effective for virus
detection, where a "smoking gun" is more like a "smoking hydrogen bomb").

> Should I just go ahead and checkin my change (it is controlled by a
> couple new options, and by default is not enabled) and let y'all point
> out my bugs?

Sure!  I'll be happy to run my tests, but note that I can no longer measure
improvements, and I know it will have no effect on my f-n rate (I know all
of my f-n intimately now, and none have an executable attachment).