[Spambayes] defaults vs. chi-square

Tim Peters tim.one@comcast.net
Mon, 14 Oct 2002 15:35:33 -0400


[T. Alexander Popiel]
> I'm being lazy today, so I haven't put this one up on my
> website in all its gory detail.

I confess I haven't been able to make enough time to follow all the msgs on
this list carefully, let alone cruise the web mining more details.  If
stupid beats smart here, let's hope lazy beats ambitious too <wink>.

> I did a cvs up, catching the changes to the histograms
> and the cost determinations.

Good!

> I did not catch Tim's last modification for tagging the cost
> computations with set/all discriminators.

That's fine -- purely cosmetic, no difference in results.

> cv1 is all defaults.  cv2 is chi-square, but otherwise default.
>
> """
> cv1s -> cv2s
> -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
> [yadda yadda yadda]
> -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
>
> false positive percentages
>     0.500  0.500  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.500  lost  +(was 0)
>     1.000  1.000  tied
>     0.000  0.500  lost  +(was 0)
>     0.000  0.000  tied
>     0.500  1.000  lost  +100.00%
>     0.000  0.000  tied
>
> won   0 times
> tied  7 times
> lost  3 times
>
> total unique fp went from 4 to 7 lost   +75.00%
> mean fp % went from 0.2 to 0.35 lost   +75.00%
>
> false negative percentages
>     2.000  1.500  won    -25.00%
>     1.500  0.500  won    -66.67%
>     4.000  2.000  won    -50.00%
>     2.000  1.000  won    -50.00%
>     2.000  1.500  won    -25.00%
>     3.000  2.000  won    -33.33%
>     5.000  3.500  won    -30.00%
>     3.000  1.500  won    -50.00%
>     5.000  2.500  won    -50.00%
>     2.000  0.500  won    -75.00%
>
> won  10 times
> tied  0 times
> lost  0 times
>
> total unique fn went from 59 to 33 won    -44.07%
> mean fn % went from 2.95 to 1.65 won    -44.07%
>
> ham mean                     ham sdev
>   17.22    0.50  -97.10%        7.39    7.04   -4.74%
>   18.69    0.27  -98.56%        7.27    3.71  -48.97%
>   18.86    0.04  -99.79%        6.50    0.41  -93.69%
>   16.79    0.41  -97.56%        7.75    4.13  -46.71%
>   18.66    0.36  -98.07%        7.09    4.84  -31.73%
>   18.47    1.01  -94.53%        7.83    9.42  +20.31%
>   18.19    0.51  -97.20%        6.99    5.47  -21.75%
>   18.38    0.16  -99.13%        6.80    1.94  -71.47%
>   17.67    0.95  -94.62%        7.88    9.40  +19.29%
>   17.72    0.14  -99.21%        6.18    1.88  -69.58%
>
> ham mean and sdev for all runs
>   18.07    0.44  -97.57%        7.22    5.65  -21.75%
>
> spam mean                    spam sdev
>   75.58   98.42  +30.22%        9.15   10.85  +18.58%
>   76.81   99.26  +29.23%        8.53    5.56  -34.82%
>   74.95   97.82  +30.51%        9.44   12.18  +29.03%
>   76.18   98.85  +29.76%        8.64    8.90   +3.01%
>   76.55   98.55  +28.74%        8.84    9.65   +9.16%
>   76.08   98.31  +29.22%        8.69   11.21  +29.00%
>   75.61   97.25  +28.62%        9.72   13.12  +34.98%
>   76.51   98.98  +29.37%        8.30    6.15  -25.90%
>   75.92   98.26  +29.43%        9.62   10.37   +7.80%
>   75.52   99.01  +31.10%        8.76    5.46  -37.67%
>
> spam mean and sdev for all runs
>   75.97   98.47  +29.62%        9.00    9.72   +8.00%
>
> ham/spam mean difference: 57.90 98.03 +40.13
> """
>
> Nothing too surprising, though I wonder if it would be good
> to mangle cmp.py to output a table for unsure like it does
> for fp and fn.  It also looks like it's using the raw untuned
> numbers for fp and fn, instead of the computed best values.

Yes, cmp.py doesn't look at the histograms at all, it's mining the
individual

> ...
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.500  lost  +(was 0)
>     1.000  1.000  tied
>     0.000  0.500  lost  +(was 0)
>     0.000  0.000  tied
> ...

output lines.  Those are still based on a single value for spam_cutoff, and
a single cutoff value doesn't really make sense for the "middle ground"
schemes.  The mean and sdev stats remain interesting for these schemes, but
cmp.py's fn and fp accounts are at best misleading for the middle-ground
schemes.  For now, the histogram analysis is the best analytic ouput we get
for such schemes.

> The best info for cv1 (defaults):
>
> """
> -> best cost $41.20
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.425 & 0.635
> ->     fp 0; fn 6; unsure ham 14; unsure spam 162
> ->     fp rate 0%; fn rate 0.3%; unsure rate 4.4%
> """

The all-default scheme does do very well; the practical difficulty has been
that "the best" cutoff values seem extremely corpus-dependent, and even so
require 3 digits of precision to express, and change depending on how much
data you train on.  Cutoffs that can only be determined after the fact, and
only when knowing exactly what the classifications *should* have been, are
impractical on several counts.  Still, if you had a time machine (so could
pick "the best" cutoffs later and apply them retroactively), nothing else
really does better.

> The best info for cv2 (chi-square):
>
> """
> -> best cost $48.00
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 3 cutoff pairs
> -> smallest ham & spam cutoffs 0.03 & 0.89
> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
> -> largest ham & spam cutoffs 0.03 & 0.9
> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
> """

And this seems a lot easier to live with in a world without time machines:
the middle ground spans a huge range of scores, yet contains a lot fewer
msgs than under highly-corpus-tuned cv1.

> The histograms for chi-square look pretty much like all the other
> histograms reported here (big spikes at the ends for the ham and
> spam, several spread lightly (and fairly evenly) over the middle
> ground.
>
> I must say that I like chi-square best out of all the ones I've
> tested, since it has fairly obvious points for the cutoffs (I suspect
> that .05 and .90 are not too far from optimal for just about everyone),
> and it does have a useful middle ground.

I agree on all counts.

> (The false positives I get from it are fairly hopeless cases:
> FDIC informing customers that NextBank died, a contractor's bid
> containing only an encoded .pdf,

That one surprises me:  assuming we threw the body away unlooked-at (we
ignore MIME sections that aren't of text/* type), it's hard to get enough
other clues to force a spam score so high.  If possible, I'd like to see the
list of clues (the "prob('word') = 0.432' thingies in the main output file,
assuing you have show_false_positives enabled).

> info requests wrt getting a new mortgage.  The false negatives are a
> bunch of particularly chatty spams, and one or two with empty bodies.
> Again, fairly hopeless.)

Long chatty spam has been pretty reliably scoring near 0.5 for me, which has
been a real advantage of chi combining.  So again I'd really like to see the
list of clues.

> I'll be testing the zcombining shortly.

I look forward to it.  Note that, as above, that's another middle-ground
scheme, so only the histogram analysis will be truly interesting.