Third result ... RE: [Spambayes] First result from Gary Robinson'sideas

Tim Peters tim.one@comcast.net
Fri, 20 Sep 2002 00:23:42 -0400


[Anthony Baxter]
> Let's try that again, this time with the current tokenizer.py

Not a bad idea <wink> -- thanks.

> same settings as before.
>
> false positive percentages
>     0.223  0.334  lost   +49.78%
>     0.278  0.278  tied
>     0.167  0.167  tied
>     0.278  0.278  tied
>     0.334  0.389  lost   +16.47%
>
> won   0 times
> tied  3 times
> lost  2 times
>
> total unique fp went from 23 to 26 lost   +13.04%
> mean fp % went from 0.255790439419 to 0.289173231033 lost   +13.05%
>
> false negative percentages
>     0.582  0.517  won    -11.17%
>     0.388  0.388  tied
>     0.581  0.516  won    -11.19%
>     0.518  0.453  won    -12.55%
>     0.712  0.712  tied
>
> won   3 times
> tied  2 times
> lost  0 times
>
> total unique fn went from 43 to 40 won     -6.98%
> mean fn % went from 0.556199665243 to 0.517406493303 won     -6.97%

There isn't a compelling case for saying this is significantly different.
But, as before, it seems more eager to call things spam, and again shifting
spam_cutoff from 0.50 to 0.525 would tell a seemingly different story (fp
would *fall* from 23 to 19, fn would *rise* from 43 to 44).  (BTW, I did
those hastily <wink> -- double-checking my claims against your "after"
histograms would be a good exercise.)  It could be that, for this kind of
especially challenging corpus, raising spam_cutoff is a good idea for good
reasons (given the still-partial implementation of Gary's suggestions -- in
the time I can make for this, I'd rather push the rest of Gary's ideas along
than try to tune the part we've already done).