[spambayes-dev] spammy subject lines
Tim Peters
tim.one at comcast.net
Mon Oct 13 20:15:19 EDT 2003
[Tony Meyer]
>>> fp total: 1 1
>>> fp %: 0.30 0.30
>>> fn total: 165 162
>>> fn %: 3.76 3.69
>>> unsure t: 528 526
>>> unsure %: 11.17 11.13
>> That's an extraordinarily high unsure %. Do you normally see
>> such a high rate? The FN rate also seems high.
> That's not that far off the normal FN rate I get (usually about 2.5
> IIRC), although I attribute a reasonable amount of that to my
> imbalance. That is much higher than the unsure rate I usually get (I
> wasn't paying enough attention, or I would have noticed that). I
> reran it with timtest.py instead of timcv.py, with n=5, and with
> balanced data and got:
Never use timtest. It's slow and too hard to interpret (we've been thru
this before, right?).
> filename: std_octs
> std_subjs
> ham:spam: 1320:1320
> 1320:1320
> fp total: 0 0
> fp %: 0.00 0.00
> fn total: 73 74
> fn %: 5.53 5.61
> unsure t: 91 93
> unsure %: 3.45 3.52
> real cost: $91.20 $92.60
> best cost: $205.00 $208.60
> h mean: 0.92 0.97
> h sdev: 5.54 5.73
> s mean: 78.52 78.61
> s sdev: 37.02 36.97
> mean diff: 77.60 77.64
> k: 1.82 1.82
>
> This is only with a very small set of data, though (tests of 66
> against 66). Would you say that this is too small a dataset to get
> valid results?
Yes.
> The FN has risen even higher, although it's (almost) the same with both,
> although unsures are back to what they normally are.
The results of timtest are too hard to interpret, and especially in summary
form.
> ... skipping more timtest output ...
>>> real cost: $280.60 $277.20
>>> best cost: $136.20 $134.00
>> Suggests that the cutoffs are far from optimal. Score
>> distribution histograms would reveal more.
> Is this what mkgraph.py does? I've never managed to figure out how
> to use that...
I don't know what mkgraph.py does; looks like it produces input for some
graph-drawing package. Unless the code has rotted due to disuse, the test
drivers automatically produce ASCII-art histograms, controlled by the
nbuckets and show_histograms options. IIRC, you cut-and-paste them out of
the full output.
>> Are you also running the mixed unigram/bigram scheme?
> Normally, yes; here, no (using a fresh from cvs copy).
Good.
>> One of the points of a cross-validation run is to get several
>> runs, and see how many won, lost, and tied. This very brief
>> summary output hides all that stuff. So, e.g., we can't tell
>> whether all 6 runs had a tiny win, or 5 lost a little and 1
>> won big, adding up to a tiny overall win. The smaller the
>> net effect in the end, the more important to see more details.
> So when using timcv.py I should post the rates.py output rather than
> cmp.py or table.py?
cmp.py produces (unless the code has rotted due to disuse) an account of how
many runs lost, won and tied. table.py is much more telegraphic, and more
useful for getting a quick feel for the relative results across several
alternatives. cmp.py is only concerned with producing "before" and "after"
statistics for a single change, and gives more detail about that change than
table.py produces.
> Is the abbreviated version ok for timtest?
Please <wink> don't use timtest.
More information about the spambayes-dev
mailing list