[Spambayes] idea for tokenizer.crack_filename change
Tim Peters
tim.one@comcast.net
Sun, 22 Sep 2002 00:46:00 -0400
[Skip Montanaro]
>>> It seems to me that base64-encoded, all DOS/Windows executables start
>>> with (reciting from memory, since I've deleted all viruses and
>>> haven't received any new ones in the last 15 minutes or so) "TPqAAA"
>>> or something similar. Why rely on finding specific file extensions?
>>> They can just change.
[Tim]
>> Well, not often, and the scheme we're working on is supposed to be
>> able tolearn when they do <wink>. Would you like to write some
>> code to tokenize this particular bit of Windows Lore?
[Skip]
> I gave it a try, but I'm still suffering with fp/fn rates around 15%,
Yuck. Weren't you going to share some detailed output? Something isn't
right, but you've already guessed that <wink>.
> so anything I see is suspect. Also, I saw no change. It's quite
> possible I have a bug, but I've also cleaned out obvious viruses
> from my corpora.
Then is there any reason to hope that you *would* see a change?
> True spam may have enough indicators elsewhere that this scheme
> won't help.
Note that no single indicator carries much weight under this scheme (whether
Paul's or Gary's doesn't matter -- one clue is just one clue, and I expect
that makes a rule-based scheme (like Greg's) more effective for virus
detection, where a "smoking gun" is more like a "smoking hydrogen bomb").
> Should I just go ahead and checkin my change (it is controlled by a
> couple new options, and by default is not enabled) and let y'all point
> out my bugs?
Sure! I'll be happy to run my tests, but note that I can no longer measure
improvements, and I know it will have no effect on my f-n rate (I know all
of my f-n intimately now, and none have an executable attachment).