[Spambayes] A couple of small tokenizer experiments.
Anthony Baxter
anthony@interlink.com.au
Tue Nov 12 07:13:56 2002
>>> Tim Peters wrote
> We don't tokenize To: now because it gives good results for bad reasons on
> mixed-source corpora. It would be good to have an option to tokenize it.
> It appears that your code also tokenized Cc:; also fine. I would rather see
> the code added to the loop currently cracking "from" lines:
>
> for field in ('from',):
>
> so that we tokenize all address thingies in a uniform way. The option would
> control the list of field names looped over there (default just from:,
> optionally also to: and cc:).
I've added this now. For me, tokenising just the 'from' line
with the new 'address_headers' option gives (vs the old code):
(all tests with 4 sets of 1200H/400S)
filename: old_from
new_from
ham:spam: 4800:1600
4800:1600
fp total: 1 1
fp %: 0.02 0.02
fn total: 12 11
fn %: 0.75 0.69
unsure t: 86 88
unsure %: 1.34 1.38
real cost: $39.20 $38.60
best cost: $31.80 $32.40
h mean: 0.36 0.36
h sdev: 4.04 4.05
s mean: 98.25 98.25
s sdev: 8.93 8.99
mean diff: 97.89 97.89
k: 7.55 7.51
The old code's best cost was:
-> achieved at ham & spam cutoffs 0.24 & 0.99
-> fp 0; fn 3; unsure ham 26; unsure spam 118
-> fp rate 0%; fn rate 0.188%; unsure rate 2.25%
The new code's best cost was:
-> largest ham & spam cutoffs 0.26 & 0.99
-> fp 0; fn 4; unsure ham 24; unsure spam 118
-> fp rate 0%; fn rate 0.25%; unsure rate 2.22%
The one additional fn was a spam that was dragged from 0.35 to
0.21 because it came from 'update@localhost.net' - the 'update'
was a strong spam clue.
Where it gets more interesting is when I also tokenize to and cc:
filename: new_from
new_fromtocc
ham:spam: 4800:1600
4800:1600
fp total: 1 1
fp %: 0.02 0.02
fn total: 4 5
fn %: 0.25 0.31
unsure t: 121 104
unsure %: 1.89 1.62
real cost: $38.20 $35.80
best cost: $32.40 $28.00
h mean: 0.36 0.31
h sdev: 4.05 3.80
s mean: 98.25 98.42
s sdev: 8.99 8.77
mean diff: 97.89 98.11
k: 7.51 7.81
We go from:
-> largest ham & spam cutoffs 0.26 & 0.99
-> fp 0; fn 4; unsure ham 24; unsure spam 118
-> fp rate 0%; fn rate 0.25%; unsure rate 2.22%
to
-> largest ham & spam cutoffs 0.22 & 0.99
-> fp 0; fn 3; unsure ham 25; unsure spam 100
-> fp rate 0%; fn rate 0.188%; unsure rate 1.95%
That's a total of 142->125 unsures. I'll accept that :)
Just to make sure, ran with a different seed.
filename: new_from2
new_fromtocc2
ham:spam: 4800:1600
4800:1600
fp total: 0 0
fp %: 0.00 0.00
fn total: 6 6
fn %: 0.38 0.38
unsure t: 110 97
unsure %: 1.72 1.52
real cost: $28.00 $25.40
best cost: $23.00 $19.20
h mean: 0.45 0.39
h sdev: 4.72 4.48
s mean: 98.44 98.56
s sdev: 8.82 8.62
mean diff: 97.99 98.17
k: 7.24 7.49
went from:
-> largest ham & spam cutoffs 0.28 & 0.94
-> fp 0; fn 6; unsure ham 23; unsure spam 62
-> fp rate 0%; fn rate 0.375%; unsure rate 1.33%
to
-> largest ham & spam cutoffs 0.24 & 0.93
-> fp 0; fn 4; unsure ham 25; unsure spam 51
-> fp rate 0%; fn rate 0.25%; unsure rate 1.19%
toemail:python.org and toemail:zope.org both show up in
my 'best discriminators' list as _very_ strong ham clues
(not suprising, given the mailing lists I'm on). My old/uncommon
email addresses generally show up as strong strong spam clues
(eg prob('toemail:arb') = 0.999356)
Next, I tried it against a chunk of my horrible corpus - 4 (out of
10) sets of 1200H/400S (out of 3500H/1800S in each set)
filename: info_from
info_fromtocc
ham:spam: 4800:1600
4800:1600
fp total: 6 7
fp %: 0.12 0.15
fn total: 4 4
fn %: 0.25 0.25
unsure t: 208 179
unsure %: 3.25 2.80
real cost: $105.60 $109.80
best cost: $78.00 $66.40
h mean: 3.05 2.63
h sdev: 10.88 10.12
s mean: 99.17 99.12
s sdev: 6.65 6.99
mean diff: 96.12 96.49
k: 5.48 5.64
That's
-> achieved at ham & spam cutoffs 0.62 & 0.99
-> fp 5; fn 11; unsure ham 44; unsure spam 41
-> fp rate 0.104%; fn rate 0.688%; unsure rate 1.33%
going to
-> achieved at ham & spam cutoffs 0.62 & 0.99
-> fp 4; fn 12; unsure ham 36; unsure spam 36
-> fp rate 0.0833%; fn rate 0.75%; unsure rate 1.12%
Anyway, the option's checked in and there, so go play. I'll run a full
test of the horror corpus overnight...
Anthony
More information about the Spambayes
mailing list