[Spambayes-checkins] spambayes tokenizer.py,1.16,1.17
Tim Peters
tim.one@comcast.net
Thu, 12 Sep 2002 10:47:47 -0400
- Previous message: [Spambayes-checkins] spambayes tokenizer.py,1.16,1.17
- Next message: [Spambayes-checkins] spambayes Options.py,1.9,1.10
mboxtest.py,1.2,1.3timtest.py,1.20,1.21 tokenizer.py,1.17,1.18
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
[Tim]
>> Modified Files:
>> tokenizer.py
>> Log Message:
>> Added code to strip uuencoded sections. As reported on the mailing list,
>> this has no effect on my results, except that one spam in now judged as
>> ham by all the other training sets. It shrinks the database size by a
>> few percent, so that makes it a tiny win. If Anthony Baxter doesn't
>> report better results on his data, I'll be sorely tempted to throw this
>> out again.
[Anthony Baxter]
> I'd say nuke it:
>
> false positive percentages
> 0.778 0.778 tied
> 0.834 0.778 won -6.71%
> 0.890 0.890 tied
> 0.667 0.611 won -8.40%
> 1.112 1.112 tied
> 0.834 0.834 tied
> 0.723 0.723 tied
> 0.667 0.611 won -8.40%
> 1.167 1.167 tied
> 1.001 1.001 tied
> 0.779 0.779 tied
> 0.667 0.611 won -8.40%
> 0.778 0.778 tied
> 0.778 0.778 tied
> 0.556 0.556 tied
> 0.778 0.723 won -7.07%
> 0.611 0.611 tied
> 0.778 0.778 tied
> 0.723 0.723 tied
> 0.667 0.667 tied
>
> won 5 times
> tied 15 times
> lost 0 times
>
> total unique fp went from 143 to 141 won -1.40%
>
> false negative percentages
> 0.646 0.646 tied
> 0.904 0.904 tied
> 0.517 0.581 lost +12.38%
> 1.229 1.229 tied
> 0.840 0.840 tied
> 1.033 1.033 tied
> 0.711 0.775 lost +9.00%
> 1.164 1.164 tied
> 0.646 0.646 tied
> 0.711 0.711 tied
> 0.646 0.711 lost +10.06%
> 0.517 0.517 tied
> 0.776 0.776 tied
> 0.646 0.646 tied
> 0.904 0.904 tied
> 1.035 1.035 tied
> 0.582 0.582 tied
> 0.581 0.581 tied
> 0.775 0.775 tied
> 0.646 0.646 tied
>
> won 0 times
> tied 17 times
> lost 3 times
So there's one spam in your Set4 that gets through when scored by Sets 1, 2
or 3 now, but two hams that are no longer called spam by any training set.
That's a small win, so I'm inclined to leave it in after all (it's a cheap
transformation, and keeps a bunch of useless "skip" tokens out of the
database).
- Previous message: [Spambayes-checkins] spambayes tokenizer.py,1.16,1.17
- Next message: [Spambayes-checkins] spambayes Options.py,1.9,1.10
mboxtest.py,1.2,1.3timtest.py,1.20,1.21 tokenizer.py,1.17,1.18
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]