[Spambayes] full o' spaces

Tim Peters tim.one at comcast.net
Sun Mar 9 16:14:55 EST 2003


[Neil Schemenauer, tests the experimental imbalance adjustment]
> Okay, I tested with a natural inbalance.  Looks like it doesn't hurt or
> help me.
>
> out/unbalanced-bases.txt -> out/unbalanced-adjusts.txt
> -> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
> ...

This is a very mild imbalance, so I don't expect much change.  The option
was introduced when people reported imbalance ratios close to 20; yours is
under 1.5.  Since you have more ham than spam, without the adjustmet the
spamprob of a ham word can get closer to 0 than the spamprob of a spam word
can get to 1, effectively giving ham words more strength than spam words.
The effect of the adjustment is to make ham words "less hammy", which should
tend to reduce FN and increase FP.  The larget the imbalance ratio, the more
pronounded these effects should be.

> false positive percentages
>     0.731  0.731  tied
>     0.366  0.366  tied
>     0.183  0.548  lost  +199.45%
>     0.183  0.183  tied
>     0.183  0.183  tied
>
> won   0 times
> tied  4 times
> lost  1 times
>
> total unique fp went from 9 to 11 lost   +22.22%
> mean fp % went from 0.329067641682 to 0.402193784278 lost   +22.22%
>
> false negative percentages
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.257  0.257  tied
>
> won   0 times
> tied  5 times
> lost  0 times
>
> total unique fn went from 1 to 1 tied
> mean fn % went from 0.051413881748 to 0.051413881748 tied
>
> ham mean                     ham sdev
>    2.61    2.94  +12.64%       11.66   12.40   +6.35%
>    2.66    2.94  +10.53%       11.20   11.87   +5.98%
>    2.42    2.71  +11.98%       11.25   12.20   +8.44%
>    1.78    2.00  +12.36%        9.05    9.81   +8.40%
>    1.92    2.15  +11.98%        9.00    9.68   +7.56%
>
> ham mean and sdev for all runs
>    2.28    2.55  +11.84%       10.50   11.26   +7.24%
>
> spam mean                    spam sdev
>   99.56   99.63   +0.07%        3.29    2.50  -24.01%
>   99.22   99.30   +0.08%        5.03    4.68   -6.96%
>   99.63   99.68   +0.05%        2.82    2.55   -9.57%
>   99.46   99.55   +0.09%        3.96    3.20  -19.19%
>   99.17   99.22   +0.05%        6.41    6.12   -4.52%
>
> spam mean and sdev for all runs
>   99.41   99.48   +0.07%        4.50    4.06   -9.78%

Since words look "less hammy" after the adjusment, an increase in both means
is expected, and the appearance of ham words in spam doesn't yank down the
spam scores as much so a decrease in spam sdev is also expected.  OTOH, the
ham words in ham are also less hammy after adjustment, so ham scores are
expected to spread more (-> increase in ham sdev).

So the changes were all qualitatively expected, and overall didn't make a
real bottom-line difference.  Imbalance this mild isn't what the gimmick was
aiming at, though -- it was aimed at stopping disastrous embarrassments for
people with extreme training ratios.

Thank you for trying it!




More information about the Spambayes mailing list