[Spambayes] Results of playing with CDB

16 Sep 2002 11:36:56 -0700

So then, Tim Peters <tim.one@comcast.net> is all like:

> > On the plus side, their inboxes are likely to be very jargon-free.
> 
> The lack of jargon is likely to hurt more than help -- the classifier
> gets as much good out of finding "good words" as "bad" ones, but the
> set of good words likely varies across users.

A jargon-free mailbox or two seems a good proving ground for whether or
not an all-users classifier is feasable.

> > Plagued by conscience, I've just run my 1000 test hams against your
> > SpamHam1.pik classifier.
> 
> Using the current code base?  

Yeah :(  I hesitated.  I lost.

> Case in point <wink>:  there's one f-p there containing all this stuff:
> 
> [ ordianed-by-mail keywords removed ]
> 
> Now if you've got one user who sucked for a minister-by-mail scam, training
> a classifier to view this as ham is going to let similar scams through to
> all your users.

That's exactly what it was, someone forwarded to me a message certifying
them as a minister.  My classifier database scores it as ham (p=0.00) on
the following words:

        '$33.90': 0.01;
        'california,': 0.01;
        'california!': 0.01; 
        "minister's": 0.01; 
        'divinity': 0.01;
        'church,': 0.01;
        'funerals,': 0.01;
        'ministers),': 0.01;
        'ministers).': 0.01;
        're-ordained.': 0.01;
        'wedding': 0.01;
        'hardbound': 0.01;
        'deliveries': 0.01;
        'ordained': 0.01; 
        'pass,': 0.01;
        'processed,': 0.01

Of course, that's because I *trained* it on this message (among others).
But what's interesting is that two of the big tip-offs from SpamHam1.pik
("ordained": 0.99, "funerals,": 0.99) show up as strong ham indicators
in my database.  Doesn't 0.01 mean it's never been seen as spam?  I know
it's not a lot of words, but I wonder if this is evidence that the
character of people's spam is just as individual as the character of
their ham.  That would point toward spammers doing more targetting than
I thought they were.