[Spambayes] Moving closer to Gary's ideal

Sat, 21 Sep 2002 08:54:25 -0400

>  I'd also appreciate it if you played with max_discriminators
> here, and/or with some other gimmick aiming to keep nearly-neutral words out
> of the scoring; if, e.g., the presence of "the" (a canonical <wink> spamprob
> 0.5 word) moves a score closer to 0.5, that's really not helping things (as
> far as I can see).

Keeping the '5's is just being REALLY purist. If everything else is right,
it shouldn't hurt.  It just falls out of really trying to be pure so that we
can do this in a mathematical way instead of a way of just constantly
finding little patches to little problems. In my experience being
"relentlessly" mathematical, as opposed to just continually finding little
"kludgy" solutions to each little aspect, generally pays off... that path
forces you to really get the basics EXACTLY right, and then various theorems
can be realistically invoked to maximize performance. OTOH sometimes the
data is just so messy to begin with that it just isn't possible to get
there.

IN ANY CASE, I think it would be pragmatic at this point to have more
experiments with small corpuses with FI1 (f(w)) and some kind of cutoff so
that only extreme values are processed... experimentally, you have accrued a
lot of evidence that, that does help under the current overall processing.

f(w) doesn't rely on esoteric theorems to be useful, as g(w) does. f(w) just
MAKES SENSE. It should work even if everything else isn't totally pure.

I would run MAX_DISCRIMINATOR testing with respect to f(w), not (p(w)), cuz
f(w) is more meaningful. Tim has tried both the 150 most extreme and the 15
most extreme... it would be interesting to see what happened with both of
those measures using abs(f(w)) as the indicator of extremeness -- maybe
that's what Tim is currently doing, I'm not sure now.

==>I would also experiment with the size of a. When a gets small (compared
to 1) it will make a very different  choice of words end up at the extremes.
a is really something that could meaningfully be tuned for performance.<==

---

For those of you who may be interested in my "purist" perspective (others
can stop reading now! ;)   ):

I want to be able to invoke a particular theorem via g(w). I have done so
before very effectively, but it may not be possible here.

I can give ideas to test to bring things further in that direction, but it
all depends on how far you guys want to go.

I would love to be able to offload experimental work with regard to that
idea from you guys and do it myself, but I can't now. I'm CEO of a company
that is in a critical period and just have no free time. I may have a bit
more free time beginning mid-october, but I'm not sure.

>From the purist perspective, I think it would be good to look at the case of
these conferences that are now being classified as spam, and see if we can
see some very fundamental reason why they are now coming out as spams. I
have some ideas which I will think about...

For instance, it is not really right that the original p's are computed by
counting ALL the occurrences of the words, even multiple occurrences in the
same email, but then we are NOT doing classification with that data but
rather just looking at presence vs. absence of a word. Until we are very
consistent about all that stuff, we may not have an underlying distribution
that enables theorems that rely on particular distributions to be invoked,
because everything just may be too off-kilter.

Also there may be other problems that have to be addressed in order to
really have a known distribution under a null hypothesis. I thought of some
possible ones last night and will continue thinking about them. Some of
these problems may be intrinsic to the data we're dealing with -- there may
be no way to get around them or we may have to do something clever to get
around them.

When I get clearer on the problems and possible solutions, I'll say so... or
if there's anyone here who wants to work with me on that perspective, that
would be great... just let me know.

But in the meantime, I think you guys should play with f(w) and tune a and
MAX_DISCRIMINATOR and see if you can optimise things for better performance
than you had before, and stay away from further work on g(w). That's
pragmatic.

--Gary

-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454

> From: Tim Peters <tim.one@comcast.net>
> Date: Sat, 21 Sep 2002 07:16:37 -0400
> To: Gary Robinson <grobinson@transpose.com>, SpamBayes <spambayes@python.org>
> Cc: glouis@dynamicro.on.ca
> Subject: RE: [Spambayes] Moving closer to Gary's ideal
> 
> Update.  My test data looks clean again.  Added 251 new spam, and purged a
> ham that's been hiding in the spam forever.  Now using 20,000 ham and 14,000
> spam.
> 
> There were two cut'n'paste typos in the code that affected results.  Fixing
> those increased the separation between the ham and spam means over what I
> reported last time.
> 
> The options used here are as reported last time:
> 
> """
> [Classifier]
> use_robinson_probability: True
> use_robinson_combining: True
> max_discriminators: 1500
> 
> [TestDriver]
> spam_cutoff: 0.50
> """
> 
> This implements everything we talked about, except for the disputed ranking
> step.  All biases are gone, and no limits are placed on the probabilities.
> For the probability adjustment step (FI1), I'm using a=1 and x=0.5 (I can
> play with this, but doubt they're the most useful things to poke at; and we
> moved to using 0.5 for the "unknown word probability" under Graham's scheme
> long ago).
> 
> Here's a before-and-after 10-fold cross validation run, where before is the
> default (our highly tweaked Graham scheme).  Each run trained on 18000 hams
> & 12600 spams, then predicted 2000 disjoint hams & 1400 disjoint spams.  As
> will be clear soon, the results aren't as bizarre as they look:
> 
> false positive percentages
>   0.000  0.250  lost  +(was 0)
>   0.000  0.200  lost  +(was 0)
>   0.000  0.250  lost  +(was 0)
>   0.000  0.100  lost  +(was 0)
>   0.050  0.450  lost  +800.00%
>   0.000  0.250  lost  +(was 0)
>   0.000  0.150  lost  +(was 0)
>   0.050  0.300  lost  +500.00%
>   0.000  0.200  lost  +(was 0)
>   0.100  0.350  lost  +250.00%
> 
> won   0 times
> tied  0 times
> lost 10 times
> 
> total unique fp went from 4 to 50 lost  +1150.00%
> mean fp % went from 0.02 to 0.25 lost  +1150.00%
> 
> false negative percentages
>   0.214  0.000  won   -100.00%
>   0.286  0.000  won   -100.00%
>   0.000  0.000  tied
>   0.143  0.000  won   -100.00%
>   0.143  0.000  won   -100.00%
>   0.286  0.000  won   -100.00%
>   0.143  0.071  won    -50.35%
>   0.143  0.000  won   -100.00%
>   0.286  0.000  won   -100.00%
>   0.071  0.000  won   -100.00%
> 
> won   9 times
> tied  1 times
> lost  0 times
> 
> total unique fn went from 24 to 1 won    -95.83%
> mean fn % went from 0.171428571428 to 0.00714285714286 won    -95.83%
> 
> So this test was a disaster for the false positive rate and a huge win for
> the false negative rate.  This is because 0.50 is too low a cutoff now:
> 
> Ham distribution for all runs:
> * = 86 items
> 0.00    0
> 2.50    0
> 5.00    0
> 7.50    0
> 10.00    0
> 12.50    0
> 15.00    0
> 17.50    0
> 20.00   26 *
> 22.50  155 **
> 25.00  627 ********
> 27.50 1859 **********************
> 30.00 3780 ********************************************
> 32.50 5108 ************************************************************
> 35.00 4264 **************************************************
> 37.50 2450 *****************************
> 40.00 1056 *************
> 42.50  395 *****
> 45.00  178 ***
> 47.50   52 *
> 50.00   30 *
> 52.50   13 *
> 55.00    4 *
> 57.50    1 *
> 60.00    1 *
> 62.50    1 *
> 65.00    0
> 67.50    0
> 70.00    0
> 72.50    0
> 75.00    0
> 77.50    0
> 80.00    0
> 82.50    0
> 85.00    0
> 87.50    0
> 90.00    0
> 92.50    0
> 95.00    0
> 97.50    0
> 
> Spam distribution for all runs:
> * = 50 items
> 0.00    0
> 2.50    0
> 5.00    0
> 7.50    0
> 10.00    0
> 12.50    0
> 15.00    0
> 17.50    0
> 20.00    0
> 22.50    0
> 25.00    0
> 27.50    0
> 30.00    0
> 32.50    0
> 35.00    0
> 37.50    0
> 40.00    0
> 42.50    1 *
> 45.00    0
> 47.50    0
> 50.00    3 *
> 52.50    6 *
> 55.00   17 *
> 57.50   40 *
> 60.00   76 **
> 62.50  171 ****
> 65.00  394 ********
> 67.50  710 ***************
> 70.00 1247 *************************
> 72.50 2358 ************************************************
> 75.00 2986 ************************************************************
> 77.50 2659 ******************************************************
> 80.00 1957 ****************************************
> 82.50 1069 **********************
> 85.00  192 ****
> 87.50   31 *
> 90.00   61 **
> 92.50   22 *
> 95.00    0
> 97.50    0
> 
> So if the cutoff were boosted to 0.575, we'd lose 30+13+4 = 47 fp, and gain
> 3+6+17 = 26 fn, for a grand total of 3 fp and 27 fn.  That would leave it
> essentially indistinguishable from the "before" run, but gets there without
> artificial biases and limits.  Good show!  Against it, I have no idea how to
> *predict* where to put the cutoff, and that simply wasn't an issue before
> (0.90 "just worked", and on this particular large test only 5 of the 20,000
> ham scored above 0.10, while only 24 of the 14,000 spam scored below 0.90).
> 
> The highest-scoring ham was again the fellow who added a one-line comment to
> a quote of an entire Nigerian scam msg.  The second-highest was again the
> lady looking for a Python course in the UK, damned by her employer's much
> longer obnoxious sig.  These are all familiar.  Something we haven't seen
> for a few weeks is the systematic reappearance of conference announcements
> among the high-scoring ham; under the current scheme, tokenization
> preserving case also hates those, and so does tokenization via word bigrams,
> and ditto via character 5-grams; the tokenization we're using now
> (case-folded unigrams but preserving punctuation) didn't hate them under the
> Graham scheme (indeed, that's the only tokenization scheme I've tried that
> *doesn't* hate them; they appear to benefit from Graham's low
> max_discriminators because the really spammish "visit our website for more
> information!" stuff doesn't show up until near the end, and by then enough
> 0.01 clues have been seen that the end doesn't matter).
> 
> The lowest-scoring spam is embarrassing:  it has
> 
>   Subject: HOW TO BECOME A MILLIONAIRE IN WEEKS!!
> 
> and the body consists solely of a uuencoded text file, which we don't
> decipher.  That's the only spam to score below 0.5.  The tokenizer doesn't
> give the inferencer much to go on here.  You'd think, e.g., that MILLIONAIRE
> in the subject line is a spam clue; and it is, but it's *so* blatant that
> few spam actually do that, leaving
> 
>   prob('subject:MILLIONAIRE') = 0.666667
> 
> Two other words in the subject were actually stronger clues:
> 
>   prob('subject:WEEKS') = 0.833333
>   prob('subject:HOW') = 0.867101
> 
> 
> Everyone who tests this (please do!  it looks very promising, although my
> data only supports that it's not a regression -- I *expect* it will do
> better for some of you), pay attention to your score histograms and figure
> out the best value for spam_cutoff from them.  That would be a good number
> to report.  I'd also appreciate it if you played with max_discriminators
> here, and/or with some other gimmick aiming to keep nearly-neutral words out
> of the scoring; if, e.g., the presence of "the" (a canonical <wink> spamprob
> 0.5 word) moves a score closer to 0.5, that's really not helping things (as
> far as I can see).  Note that if you fiddle with both, they're most likely
> not independent, so be sure to keep looking at the histograms (they reveal a
> hell of a lot more than the raw error rates do).
>