[Spambayes-checkins] spambayes tokenizer.py,1.16,1.17

Tim Peters tim.one@comcast.net
Thu, 12 Sep 2002 10:47:47 -0400


[Tim]
>> Modified Files:
>> 	tokenizer.py
>> Log Message:
>> Added code to strip uuencoded sections.  As reported on the mailing list,
>> this has no effect on my results, except that one spam in now judged as
>> ham by all the other training sets.  It shrinks the database size by a
>> few percent, so that makes it a tiny win.  If Anthony Baxter doesn't
>> report better results on his data, I'll be sorely tempted to throw this
>> out again.

[Anthony Baxter]
> I'd say nuke it:
>
> false positive percentages
>     0.778  0.778  tied
>     0.834  0.778  won     -6.71%
>     0.890  0.890  tied
>     0.667  0.611  won     -8.40%
>     1.112  1.112  tied
>     0.834  0.834  tied
>     0.723  0.723  tied
>     0.667  0.611  won     -8.40%
>     1.167  1.167  tied
>     1.001  1.001  tied
>     0.779  0.779  tied
>     0.667  0.611  won     -8.40%
>     0.778  0.778  tied
>     0.778  0.778  tied
>     0.556  0.556  tied
>     0.778  0.723  won     -7.07%
>     0.611  0.611  tied
>     0.778  0.778  tied
>     0.723  0.723  tied
>     0.667  0.667  tied
>
> won   5 times
> tied 15 times
> lost  0 times
>
> total unique fp went from 143 to 141 won     -1.40%
>
> false negative percentages
>     0.646  0.646  tied
>     0.904  0.904  tied
>     0.517  0.581  lost   +12.38%
>     1.229  1.229  tied
>     0.840  0.840  tied
>     1.033  1.033  tied
>     0.711  0.775  lost    +9.00%
>     1.164  1.164  tied
>     0.646  0.646  tied
>     0.711  0.711  tied
>     0.646  0.711  lost   +10.06%
>     0.517  0.517  tied
>     0.776  0.776  tied
>     0.646  0.646  tied
>     0.904  0.904  tied
>     1.035  1.035  tied
>     0.582  0.582  tied
>     0.581  0.581  tied
>     0.775  0.775  tied
>     0.646  0.646  tied
>
> won   0 times
> tied 17 times
> lost  3 times

So there's one spam in your Set4 that gets through when scored by Sets 1, 2
or 3 now, but two hams that are no longer called spam by any training set.
That's a small win, so I'm inclined to leave it in after all (it's a cheap
transformation, and keeps a bunch of useless "skip" tokens out of the
database).