[spambayes-dev] spammy subject lines

Mon Oct 13 19:54:41 EDT 2003

>> fp total:        1       1
>> fp %:         0.30    0.30
>> fn total:      165     162
>> fn %:         3.76    3.69
>> unsure t:      528     526
>> unsure %:    11.17   11.13
> That's an extraordinarily high unsure %.  Do you normally see
> such a high rate?  The FN rate also seems high.

That's not that far off the normal FN rate I get (usually about 2.5 IIRC),
although I attribute a reasonable amount of that to my imbalance.  That is
much higher than the unsure rate I usually get (I wasn't paying enough
attention, or I would have noticed that).  I reran it with timtest.py
instead of timcv.py, with n=5, and with balanced data and got:

filename:  std_octs
                   std_subjs
ham:spam:  1320:1320
                   1320:1320
fp total:        0       0
fp %:         0.00    0.00
fn total:       73      74
fn %:         5.53    5.61
unsure t:       91      93
unsure %:     3.45    3.52
real cost:  $91.20  $92.60
best cost: $205.00 $208.60
h mean:       0.92    0.97
h sdev:       5.54    5.73
s mean:      78.52   78.61
s sdev:      37.02   36.97
mean diff:   77.60   77.64
k:            1.82    1.82

This is only with a very small set of data, though (tests of 66 against 66).
Would you say that this is too small a dataset to get valid results?  The FN
has risen even higher, although it's (almost) the same with both, although
unsures are back to what they normally are.

So I pulled out a more ham and tried with that (again timtest.py -n5):

filename:  lg_octs lg_oct_subjs
ham:spam:  9248:17592
                   9248:17592
fp total:        3       4
fp %:         0.03    0.04
fn total:      161     163
fn %:         0.92    0.93
unsure t:      450     448
unsure %:     1.68    1.67
real cost: $281.00 $292.60
best cost: $538.60 $555.40
h mean:       0.69    0.71
h sdev:       5.75    5.84
s mean:      95.45   95.50
s sdev:      17.47   17.37
mean diff:   94.76   94.79
k:            4.08    4.08

These numbers look nicer :)  Although not a win for the change...

> > real cost: $280.60 $277.20
> > best cost: $136.20 $134.00
> 
> Suggests that the cutoffs are far from optimal.  Score
> distribution histograms would reveal more.

Is this what mkgraph.py does?  I've never managed to figure out how to use
that...

> Are you also running the mixed unigram/bigram scheme?

Normally, yes; here, no (using a fresh from cvs copy).

> One of the points of a cross-validation run is to get several
> runs, and see how many won, lost, and tied.  This very brief 
> summary output hides all that stuff.  So, e.g., we can't tell 
> whether all 6 runs had a tiny win, or 5 lost a little and 1 
> won big, adding up to a tiny overall win.  The smaller the 
> net effect in the end, the more important to see more details.

So when using timcv.py I should post the rates.py output rather than cmp.py
or table.py?  Is the abbreviated version ok for timtest?

=Tony Meyer