[Spambayes] spamprob combining

Gary Robinson grobinson@transpose.com
Sat, 12 Oct 2002 11:39:24 -0400


This sounds like it's working out pretty well!

If we get to the point that it becomes the accepted technique for spambayes,
I'll add it to the my online essay.

NOTE: 

As we've discussed ad nauseum, this multipicative thing is one-sided in its
sensitivity, which is why we end up having to do something like S/(S+H)
where S is based on (1-p) calcs for combining the p's and H is based on p
calcs.

There ARE meta-analytical ways of combining the p-values which are equally
sensitive on both sides... but are a TAD overall less sensitive than the
chi-square thing. And frankly, the S/(S+H)-style trick may take away a lot
of that super-strength super sensitivity anyway -- maybe even all of the
advantage over other methods (I just don't know without directly testing
it).

So a two-sided combining approach may perform equally well for our practical
purposes... there's no way of knowing without trying.

The advantage of such an approach would essentially be algorithmic elegance.
No longer would we need that klugy (P-Q)/(P+Q) or S/(S+H) stuff which
doesn't convert to a real probability.

Instead, the combined P would be all we would need. Combined P near 1 would
be spammy, and combined P near 0 would by hammy. And P would be a REAL
probability (against the null hypothesis of randomness).

I wouldn't expect any performance ADVANTAGE to this other approach, but it
WOULD be more elegant. (Note, all these approaches depend on one or another
statistical function as the current one does the inverse-chi-square).

If you are interested in going that way let me know, and I'll send info on
how to do it. Maybe you'll have another beautifully simple algorithm up your
sleave to implement the necessary statistical function.


--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


> From: Tim Peters <tim.one@comcast.net>
> Date: Sat, 12 Oct 2002 02:27:29 -0400
> To: SpamBayes <spambayes@python.org>
> Cc: Gary Robinson <grobinson@transpose.com>
> Subject: RE: [Spambayes] spamprob combining
> 
> OK!  Gary and I exchanged info offline, and I believe the implementation of
> use_chi_squared_combining matches his intent for it.
> 
>> ...
>> Example:  if we called everything from 50 thru 80 "the middle
>> ground", ... in a manual-review system, this combines all the
>> desirable properties:
>> 
>> 1. Very little is kicked out for review.
>> 
>> 2. There are high error rates among the msgs kicked out for review.
>> 
>> 3. There are unmeasurably low error rates among the msgs not kicked
>>    out for review.
> 
> On my full 20,000 ham + 14,000 spam test, and with spam_cutoff 0.70, this
> got 3 FP and 11 FN in a 10-fold CV run, compared to 2 FP and 11 FN under the
> all-default scheme with the very touchy spam_cutoff.  The middle ground is
> the *interesting* thing, and it's like a laser beam here (yippee!).  In the
> "50 thru 80" range guessed at above,
> 
> 1. 12 of 20,000 hams lived there, 1 of the FPs among them (scoring 0.737).
>  The other 2 FP scored 0.999999929221 (Nigerian scam quote) and
>  0.972986477986 (lady with the short question and long obnoxious
>  employer-generated SIG).  I don't believe any usable scheme will
>  ever call those ham, though, or put them in a middle ground without
>  greatly bloating the middle ground with correctly classified
>  messages.
> 
> 2. 14 of 14,000 spams lived there, including 8 (yowza!) of the 11 FN
>  (with 3 scores a bit above 0.5, 1 near 0.56, 1 near 0.58, 1 near
>  0.61, 1 near 0.63, and 1 near 0.68).  The 3 remaining spam scored
>  below 0.50:
> 
> 0.35983017036
>   "Hello, my Name is BlackIntrepid"
>   Except that it contained a URL and an invitation to visit it, this
>   could have been a poorly written c.l.py post explaining a bit
>   about hackers to newbies (and if you don't think there are
>   plenty of those in my ham, you don't read c.l.py <wink>).
> 
> 0.39570232415
>   The embarrassing "HOW TO BECOME A MILLIONAIRE IN WEEKS!!" spam,
>   whose body consists of a uuencoded text file we throw away
>   unlooked at.  (This is quite curable, but I doubt it's worth
>   the bother -- at least until spammers take to putting everything
>   in uuencoded text files!)
> 
> 0.499567195859 (about as close to "middle ground" cutoff as can be)
>   A giant (> 20KB) base64-encoded plain text file.  I've never
>   bothered to decode this to see what it says; like the others,
>   though, it's been a persistent FN under all schemes.  Note that
>   we do decode this; I've always assumed it's of the "long, chatty,
>   just-folks" flavor of tech spam that's hard to catch; the list of
>   clues contains "cookies", "editor", "ms-dos", "backslashes",
>   "guis", "commands", "folder", "dumb", "(well,", "cursor",
>   and "trick" (a spamprob 0.00183748 word!).
> 
> 
> For my original purpose of looking at a scheme for c.l.py traffic, this has
> become the clear leader among all schemes:  while it's more extreme than I
> might like, it made very few errors, and a miniscule middle ground (less
> than 0.08% of all msgs) contains 64+% of all errors.  3 FN would survive,
> and 2 FP, but I don't expect that any usable scheme could do better on this
> data.  Note that Graham combining was also very extreme, but had *no* usable
> middle ground on this data:  all mistakes had scores of almost exactly 0.0
> or almost exactly 1.0 (and there were more mistakes).
> 
> How does it do for you?  An analysis like the above is what I'm looking for,
> although it surely doesn't need to be so detailed.  Here's the .ini file I
> used:
> 
> """
> [Classifier]
> use_chi_squared_combining: True
> 
> [TestDriver]
> spam_cutoff: 0.70
> 
> nbuckets: 200
> best_cutoff_fp_weight: 10
> 
> show_false_positives: True
> show_false_negatives: True
> show_best_discriminators: 50
> show_spam_lo = 0.40
> show_spam_hi = 0.80
> show_ham_lo = 0.40
> show_ham_hi = 0.80
> show_charlimit: 100000
> """
> 
> Your best spam_cutoff may be different, but the point to this exercise isn't
> to find the best cutoff, it's to think about the middle ground.  Note that I
> set
> 
>  show_{ham,spam}_{lo,hi}
> 
> to values such that I would see every ham and spam that lived in my presumed
> middle ground of 0.50-0.80, plus down to 0.40 on the low end.   I also set
> show_charlimit to a large value so that I'd see the full text of each such
> msg.
> 
> Heh:  My favorite:  Data/Ham/Set7/51781.txt got overall score 0.485+, close
> to the middle ground cutoff.  It's a msg I posted 2 years ago to the day (12
> Sep 2000), and consists almost entirely of a rather long transcript of part
> of the infamous Chicago Seven trial:
> 
>   http://www.law.umkc.edu/faculty/projects/ftrials/Chicago7/chicago7.html
> 
> I learned two things from this <wink>:
> 
> 1. There are so many unique lexical clues when I post a thing, I can
>  get away with posting anything.
> 
> 2. "tyranny" is a spam clue, but "nazi" a ham clue:
> 
>     prob('tyranny') = 0.850877
>     prob('nazi')    = 0.282714
> 
> leaving-lexical-clues-amid-faux-intimations-of-profundity-ly y'rs  - tim
>