[spambayes-dev] spammy subject lines
Tony Meyer
tameyer at ihug.co.nz
Mon Oct 13 19:54:41 EDT 2003
>> fp total: 1 1
>> fp %: 0.30 0.30
>> fn total: 165 162
>> fn %: 3.76 3.69
>> unsure t: 528 526
>> unsure %: 11.17 11.13
> That's an extraordinarily high unsure %. Do you normally see
> such a high rate? The FN rate also seems high.
That's not that far off the normal FN rate I get (usually about 2.5 IIRC),
although I attribute a reasonable amount of that to my imbalance. That is
much higher than the unsure rate I usually get (I wasn't paying enough
attention, or I would have noticed that). I reran it with timtest.py
instead of timcv.py, with n=5, and with balanced data and got:
filename: std_octs
std_subjs
ham:spam: 1320:1320
1320:1320
fp total: 0 0
fp %: 0.00 0.00
fn total: 73 74
fn %: 5.53 5.61
unsure t: 91 93
unsure %: 3.45 3.52
real cost: $91.20 $92.60
best cost: $205.00 $208.60
h mean: 0.92 0.97
h sdev: 5.54 5.73
s mean: 78.52 78.61
s sdev: 37.02 36.97
mean diff: 77.60 77.64
k: 1.82 1.82
This is only with a very small set of data, though (tests of 66 against 66).
Would you say that this is too small a dataset to get valid results? The FN
has risen even higher, although it's (almost) the same with both, although
unsures are back to what they normally are.
So I pulled out a more ham and tried with that (again timtest.py -n5):
filename: lg_octs lg_oct_subjs
ham:spam: 9248:17592
9248:17592
fp total: 3 4
fp %: 0.03 0.04
fn total: 161 163
fn %: 0.92 0.93
unsure t: 450 448
unsure %: 1.68 1.67
real cost: $281.00 $292.60
best cost: $538.60 $555.40
h mean: 0.69 0.71
h sdev: 5.75 5.84
s mean: 95.45 95.50
s sdev: 17.47 17.37
mean diff: 94.76 94.79
k: 4.08 4.08
These numbers look nicer :) Although not a win for the change...
> > real cost: $280.60 $277.20
> > best cost: $136.20 $134.00
>
> Suggests that the cutoffs are far from optimal. Score
> distribution histograms would reveal more.
Is this what mkgraph.py does? I've never managed to figure out how to use
that...
> Are you also running the mixed unigram/bigram scheme?
Normally, yes; here, no (using a fresh from cvs copy).
> One of the points of a cross-validation run is to get several
> runs, and see how many won, lost, and tied. This very brief
> summary output hides all that stuff. So, e.g., we can't tell
> whether all 6 runs had a tiny win, or 5 lost a little and 1
> won big, adding up to a tiny overall win. The smaller the
> net effect in the end, the more important to see more details.
So when using timcv.py I should post the rates.py output rather than cmp.py
or table.py? Is the abbreviated version ok for timtest?
=Tony Meyer
More information about the spambayes-dev
mailing list