[spambayes-dev] spammy subject lines

Mon Oct 13 20:15:19 EDT 2003

[Tony Meyer]
>>> fp total:        1       1
>>> fp %:         0.30    0.30
>>> fn total:      165     162
>>> fn %:         3.76    3.69
>>> unsure t:      528     526
>>> unsure %:    11.17   11.13

>> That's an extraordinarily high unsure %.  Do you normally see
>> such a high rate?  The FN rate also seems high.

> That's not that far off the normal FN rate I get (usually about 2.5
> IIRC), although I attribute a reasonable amount of that to my
> imbalance.  That is much higher than the unsure rate I usually get (I
> wasn't paying enough attention, or I would have noticed that).  I
> reran it with timtest.py instead of timcv.py, with n=5, and with
> balanced data and got:

Never use timtest.  It's slow and too hard to interpret (we've been thru
this before, right?).

> filename:  std_octs
>                    std_subjs
> ham:spam:  1320:1320
>                    1320:1320
> fp total:        0       0
> fp %:         0.00    0.00
> fn total:       73      74
> fn %:         5.53    5.61
> unsure t:       91      93
> unsure %:     3.45    3.52
> real cost:  $91.20  $92.60
> best cost: $205.00 $208.60
> h mean:       0.92    0.97
> h sdev:       5.54    5.73
> s mean:      78.52   78.61
> s sdev:      37.02   36.97
> mean diff:   77.60   77.64
> k:            1.82    1.82
>
> This is only with a very small set of data, though (tests of 66
> against 66). Would you say that this is too small a dataset to get
> valid results?

Yes.

> The FN has risen even higher, although it's (almost) the same with both,
> although unsures are back to what they normally are.

The results of timtest are too hard to interpret, and especially in summary
form.

> ... skipping more timtest output ...

>>> real cost: $280.60 $277.20
>>> best cost: $136.20 $134.00

>> Suggests that the cutoffs are far from optimal.  Score
>> distribution histograms would reveal more.

> Is this what mkgraph.py does?  I've never managed to figure out how
> to use that...

I don't know what mkgraph.py does; looks like it produces input for some
graph-drawing package.  Unless the code has rotted due to disuse, the test
drivers automatically produce ASCII-art histograms, controlled by the
nbuckets and show_histograms options.  IIRC, you cut-and-paste them out of
the full output.

>> Are you also running the mixed unigram/bigram scheme?

> Normally, yes; here, no (using a fresh from cvs copy).

Good.

>> One of the points of a cross-validation run is to get several
>> runs, and see how many won, lost, and tied.  This very brief
>> summary output hides all that stuff.  So, e.g., we can't tell
>> whether all 6 runs had a tiny win, or 5 lost a little and 1
>> won big, adding up to a tiny overall win.  The smaller the
>> net effect in the end, the more important to see more details.

> So when using timcv.py I should post the rates.py output rather than
> cmp.py or table.py?

cmp.py produces (unless the code has rotted due to disuse) an account of how
many runs lost, won and tied.  table.py is much more telegraphic, and more
useful for getting a quick feel for the relative results across several
alternatives.  cmp.py is only concerned with producing "before" and "after"
statistics for a single change, and gives more detail about that change than
table.py produces.

> Is the abbreviated version ok for timtest?

Please <wink> don't use timtest.