[Spambayes] Scoring spam discussions
Tim Peters
tim.one@comcast.net
Sat Oct 19 08:56:56 2002
Here's an interesting result: I've got a separate folder for my email
discussing this project. It currently contains 1,316 msgs, and *none* of
them have been trained on.
Just for fun, I scored them with my growing-but-still-small "Tim's email"
classifier. Result: almost all scored 0.00 under chi-combining, including
the ones you expected would score as spam <wink>.
Non-zero scores:
0.01 7
0.02 6
0.03 1
0.04 5
------------- 0.05 is a paranoid "I'm sure it's ham" chi cutoff
0.06 3
0.08 1
0.09 1
------------- 0.10 is a conservative chi ham_cutoff
0.11 2
0.12 1
0.13 1 pvt 30KB email from someone including a full listing of all
their FP on a run
------------- 0.30 is a fine chi ham_cutoff on my c.l.py data
0.40 1 strange brief note from an acm.org spam-filter developer
0.66 1 A msg from me, from PythonLabs email discussions that
took place before any code was written.
That last was a forwarded Asian spam, with a bunch of my comments, and the
Subject line is:
Subject: [PythonLabs]
=?ks_c_5601-1987?B?Rlc6ICixpLDtKbDmwO+75yC48LTPxc24tSC8rbrxvbo=?=
It turns out that the MIME structure in this msg is damaged, and the email
package gave up after parsing the headers. The high spam score (which is
nevertheless solidly in chi's middle ground, thanks to finding clues that I
sent this msg), was mostly due to all the gibberish in the Subject line
(Rob, avert your eyes <wink>):
'subject:[' 0.206009
'subject:PythonLabs' 0.228589
'subject:-' 0.356645
'subject:?' 0.681345
'subject:1987' 0.844828
'subject:] =?' 0.844828
'subject:ks_c_5601' 0.844828
'subject:skip:7 20' 0.844828
'subject:skip:R 10' 0.844828
'subject:=?=' 0.978469
'subject:+' 0.980349
If I repair the MIME by hand, so that it sees my comments (as well as the
forwarded spam), the chi score falls to 0.03. The forwarded spam in
isolation scores 1.00. My comments in isolation broke the Pentium's
underflow trap <wink>.
damn-this-stuff-works-good-ly y'rs - tim