[Spambayes-checkins] spambayes tokenizer.py,1.16,1.17

Anthony Baxter anthony@interlink.com.au
Thu, 12 Sep 2002 17:13:20 +1000


>>> "Tim Peters" wrote
> Modified Files:
> 	tokenizer.py 
> Log Message:
> Added code to strip uuencoded sections.  As reported on the mailing list,
> this has no effect on my results, except that one spam in now judged as
> ham by all the other training sets.  It shrinks the database size by a
> few percent, so that makes it a tiny win.  If Anthony Baxter doesn't
> report better results on his data, I'll be sorely tempted to throw this
> out again.

I'd say nuke it:

anthony_tok1.16s -> anthony_tok1.17s

false positive percentages
    0.778  0.778  tied
    0.834  0.778  won     -6.71%
    0.890  0.890  tied
    0.667  0.611  won     -8.40%
    1.112  1.112  tied
    0.834  0.834  tied
    0.723  0.723  tied
    0.667  0.611  won     -8.40%
    1.167  1.167  tied
    1.001  1.001  tied
    0.779  0.779  tied
    0.667  0.611  won     -8.40%
    0.778  0.778  tied
    0.778  0.778  tied
    0.556  0.556  tied
    0.778  0.723  won     -7.07%
    0.611  0.611  tied
    0.778  0.778  tied
    0.723  0.723  tied
    0.667  0.667  tied

won   5 times
tied 15 times
lost  0 times

total unique fp went from 143 to 141 won     -1.40%

false negative percentages
    0.646  0.646  tied
    0.904  0.904  tied
    0.517  0.581  lost   +12.38%
    1.229  1.229  tied
    0.840  0.840  tied
    1.033  1.033  tied
    0.711  0.775  lost    +9.00%
    1.164  1.164  tied
    0.646  0.646  tied
    0.711  0.711  tied
    0.646  0.711  lost   +10.06%
    0.517  0.517  tied
    0.776  0.776  tied
    0.646  0.646  tied
    0.904  0.904  tied
    1.035  1.035  tied
    0.582  0.582  tied
    0.581  0.581  tied
    0.775  0.775  tied
    0.646  0.646  tied

won   0 times
tied 17 times
lost  3 times