[Spambayes] A couple of small tokenizer experiments.

Tue Nov 12 07:13:56 2002

>>> Tim Peters wrote
> We don't tokenize To: now because it gives good results for bad reasons on
> mixed-source corpora.  It would be good to have an option to tokenize it.
> It appears that your code also tokenized Cc:; also fine.  I would rather see
> the code added to the loop currently cracking "from" lines:
> 
>         for field in ('from',):
> 
> so that we tokenize all address thingies in a uniform way.  The option would
> control the list of field names looped over there (default just from:,
> optionally also to: and cc:).

I've added this now. For me, tokenising just the 'from' line
with the new 'address_headers' option gives (vs the old code):

(all tests with 4 sets of 1200H/400S)

filename:  old_from       
                   new_from
ham:spam:  4800:1600      
                   4800:1600
fp total:        1       1
fp %:         0.02    0.02
fn total:       12      11
fn %:         0.75    0.69
unsure t:       86      88
unsure %:     1.34    1.38
real cost:  $39.20  $38.60
best cost:  $31.80  $32.40
h mean:       0.36    0.36
h sdev:       4.04    4.05
s mean:      98.25   98.25
s sdev:       8.93    8.99
mean diff:   97.89   97.89
k:            7.55    7.51

The old code's best cost was:
-> achieved at ham & spam cutoffs 0.24 & 0.99
->     fp 0; fn 3; unsure ham 26; unsure spam 118
->     fp rate 0%; fn rate 0.188%; unsure rate 2.25%

The new code's best cost was:
-> largest ham & spam cutoffs 0.26 & 0.99
->     fp 0; fn 4; unsure ham 24; unsure spam 118
->     fp rate 0%; fn rate 0.25%; unsure rate 2.22%

The one additional fn was a spam that was dragged from 0.35 to
0.21 because it came from 'update@localhost.net' - the 'update'
was a strong spam clue.

Where it gets more interesting is when I also tokenize to and cc:

filename:  new_from       
                   new_fromtocc
ham:spam:  4800:1600      
                   4800:1600
fp total:        1       1
fp %:         0.02    0.02
fn total:        4       5
fn %:         0.25    0.31
unsure t:      121     104
unsure %:     1.89    1.62
real cost:  $38.20  $35.80
best cost:  $32.40  $28.00
h mean:       0.36    0.31
h sdev:       4.05    3.80
s mean:      98.25   98.42
s sdev:       8.99    8.77
mean diff:   97.89   98.11
k:            7.51    7.81

We go from:
-> largest ham & spam cutoffs 0.26 & 0.99
->     fp 0; fn 4; unsure ham 24; unsure spam 118
->     fp rate 0%; fn rate 0.25%; unsure rate 2.22%

to
-> largest ham & spam cutoffs 0.22 & 0.99
->     fp 0; fn 3; unsure ham 25; unsure spam 100
->     fp rate 0%; fn rate 0.188%; unsure rate 1.95%

That's a total of 142->125 unsures. I'll accept that :)

Just to make sure, ran with a different seed.
filename:  new_from2      
                   new_fromtocc2
ham:spam:  4800:1600      
                   4800:1600
fp total:        0       0
fp %:         0.00    0.00
fn total:        6       6
fn %:         0.38    0.38
unsure t:      110      97
unsure %:     1.72    1.52
real cost:  $28.00  $25.40
best cost:  $23.00  $19.20
h mean:       0.45    0.39
h sdev:       4.72    4.48
s mean:      98.44   98.56
s sdev:       8.82    8.62
mean diff:   97.99   98.17
k:            7.24    7.49

went from:
-> largest ham & spam cutoffs 0.28 & 0.94
->     fp 0; fn 6; unsure ham 23; unsure spam 62
->     fp rate 0%; fn rate 0.375%; unsure rate 1.33%
to
-> largest ham & spam cutoffs 0.24 & 0.93
->     fp 0; fn 4; unsure ham 25; unsure spam 51
->     fp rate 0%; fn rate 0.25%; unsure rate 1.19%

toemail:python.org and toemail:zope.org both show up in 
my 'best discriminators' list as _very_ strong ham clues
(not suprising, given the mailing lists I'm on). My old/uncommon
email addresses generally show up as strong strong spam clues
(eg prob('toemail:arb') = 0.999356)

Next, I tried it against a chunk of my horrible corpus - 4 (out of
10) sets of 1200H/400S (out of 3500H/1800S in each set)

filename:  info_from     
                 info_fromtocc
ham:spam:  4800:1600      
                   4800:1600
fp total:        6       7
fp %:         0.12    0.15
fn total:        4       4
fn %:         0.25    0.25
unsure t:      208     179
unsure %:     3.25    2.80
real cost: $105.60 $109.80
best cost:  $78.00  $66.40
h mean:       3.05    2.63
h sdev:      10.88   10.12
s mean:      99.17   99.12
s sdev:       6.65    6.99
mean diff:   96.12   96.49
k:            5.48    5.64

That's 
-> achieved at ham & spam cutoffs 0.62 & 0.99
->     fp 5; fn 11; unsure ham 44; unsure spam 41
->     fp rate 0.104%; fn rate 0.688%; unsure rate 1.33%

going to
-> achieved at ham & spam cutoffs 0.62 & 0.99
->     fp 4; fn 12; unsure ham 36; unsure spam 36
->     fp rate 0.0833%; fn rate 0.75%; unsure rate 1.12%

Anyway, the option's checked in and there, so go play. I'll run a full
test of the horror corpus overnight...

Anthony