Third result ... RE: [Spambayes] First result from Gary
Robinson'sideas
Tim Peters
tim.one@comcast.net
Fri, 20 Sep 2002 00:23:42 -0400
[Anthony Baxter]
> Let's try that again, this time with the current tokenizer.py
Not a bad idea <wink> -- thanks.
> same settings as before.
>
> false positive percentages
> 0.223 0.334 lost +49.78%
> 0.278 0.278 tied
> 0.167 0.167 tied
> 0.278 0.278 tied
> 0.334 0.389 lost +16.47%
>
> won 0 times
> tied 3 times
> lost 2 times
>
> total unique fp went from 23 to 26 lost +13.04%
> mean fp % went from 0.255790439419 to 0.289173231033 lost +13.05%
>
> false negative percentages
> 0.582 0.517 won -11.17%
> 0.388 0.388 tied
> 0.581 0.516 won -11.19%
> 0.518 0.453 won -12.55%
> 0.712 0.712 tied
>
> won 3 times
> tied 2 times
> lost 0 times
>
> total unique fn went from 43 to 40 won -6.98%
> mean fn % went from 0.556199665243 to 0.517406493303 won -6.97%
There isn't a compelling case for saying this is significantly different.
But, as before, it seems more eager to call things spam, and again shifting
spam_cutoff from 0.50 to 0.525 would tell a seemingly different story (fp
would *fall* from 23 to 19, fn would *rise* from 43 to 44). (BTW, I did
those hastily <wink> -- double-checking my claims against your "after"
histograms would be a good exercise.) It could be that, for this kind of
especially challenging corpus, raising spam_cutoff is a good idea for good
reasons (given the still-partial implementation of Gary's suggestions -- in
the time I can make for this, I'd rather push the rest of Gary's ideas along
than try to tune the part we've already done).