[Tracker-discuss] Some observations about the spam filter
skip at pobox.com
skip at pobox.com
Sun Aug 24 20:02:56 CEST 2008
On August 11 I wrote:
me> I just worked my way through the current pile of SpamBayes messages.
me> There were actually a couple spams. (At least I'm fairly certain
me> they were spam. They were in French, didn't appear to have anything
me> to do with Python and were in HTML format.)
me> A couple things jumped out at me:
me> 1. It looks like synthetic tokens are being generated in both
me> detectors/spambayes.py and extensions/spambayes.py. They both
me> have somewhat different versions of an extract_classinfo()
me> function. Can we get away with a single version of that
me> function?
me> 2. Many messages mention a Subversion revision number. These are
me> almost always different. We should generate a synthetic token
me> which indicates whether or not a submission contained what looked
me> like a revision. I'll check something in for that shortly once I
me> understand how I should deal with item #1.
me> 3. If the body of the message was "My dog has fleas." it would be
me> presented to the spam filter as "content:My dog has fleas." That
me> is, the first word is always prefixed by the string "content:".
me> I can't tell where that's getting applied, but we should get rid
me> of it.
I've not seen a reply about this. I realize Martin is on holiday. Has
anyone else who has seen this note got an opinion? I created issue 215 with
a patch for detectors/spambayes.py to add a hasrev token:
http://psf.upfronthosting.co.za/roundup/meta/issue215
Thx,
Skip
More information about the Tracker-discuss
mailing list