[Spambayes] Results of playing with CDB
Neale Pickett
neale@woozle.org
16 Sep 2002 11:36:56 -0700
So then, Tim Peters <tim.one@comcast.net> is all like:
> > On the plus side, their inboxes are likely to be very jargon-free.
>
> The lack of jargon is likely to hurt more than help -- the classifier
> gets as much good out of finding "good words" as "bad" ones, but the
> set of good words likely varies across users.
A jargon-free mailbox or two seems a good proving ground for whether or
not an all-users classifier is feasable.
> > Plagued by conscience, I've just run my 1000 test hams against your
> > SpamHam1.pik classifier.
>
> Using the current code base?
Yeah :( I hesitated. I lost.
> Case in point <wink>: there's one f-p there containing all this stuff:
>
> [ ordianed-by-mail keywords removed ]
>
> Now if you've got one user who sucked for a minister-by-mail scam, training
> a classifier to view this as ham is going to let similar scams through to
> all your users.
That's exactly what it was, someone forwarded to me a message certifying
them as a minister. My classifier database scores it as ham (p=0.00) on
the following words:
'$33.90': 0.01;
'california,': 0.01;
'california!': 0.01;
"minister's": 0.01;
'divinity': 0.01;
'church,': 0.01;
'funerals,': 0.01;
'ministers),': 0.01;
'ministers).': 0.01;
're-ordained.': 0.01;
'wedding': 0.01;
'hardbound': 0.01;
'deliveries': 0.01;
'ordained': 0.01;
'pass,': 0.01;
'processed,': 0.01
Of course, that's because I *trained* it on this message (among others).
But what's interesting is that two of the big tip-offs from SpamHam1.pik
("ordained": 0.99, "funerals,": 0.99) show up as strong ham indicators
in my database. Doesn't 0.01 mean it's never been seen as spam? I know
it's not a lot of words, but I wonder if this is evidence that the
character of people's spam is just as individual as the character of
their ham. That would point toward spammers doing more targetting than
I thought they were.