[spambayes-dev] Incremental training results

Thu Jan 8 00:36:31 EST 2004

In message:  <1ED4ECF91CDED24C8D012BCF2B034F13026F2A3B at its-xchg4.massey.ac.nz>
             "Tony Meyer" <ta-meyer at ihug.co.nz> writes:

>I finally got around to having a go at the incremental training setup
>today.

Huzzah!

>I *think* I got it working, and I think I kinda understand what the
>results are telling me.
>
>The graphs are here:
>
><http://www.massey.ac.nz/~tameyer/research/spambayes/incremental.html>

Hrm.  You don't have the X axis labeled; what units is it using?
Days (or rather, groups) as I did?  What happens at about 250 to
pull it out of what looks like an approximation of an inverse
function (with all data lines overlapping) to a very distinct set
of separate lines?

Can you post the changes to mkgraph.py?

>If someone (Alex?) would like to quickly eyeball them and say whether
>they look like they might be right that would be cool :)

They look a bit bizarre to me, with that dramatic behaviour change at
250.

>I also had a stab at creating a regime, which might possibly be all
>wrong :)

Your regime looks fine to me.

>(All the testing is with the default option settings).

OK.  The incremental harness is built to do all 10 classifiers
at once (for the input sans each set) by default.  There's a command
line option to do just one classifier (excluding a specified set),
which I always use (my machine doesn't have the memory to hold all 10
classifiers at once).  I'm guessing that you used the former (default)
behaviour... and it's been long enough since I wrote it that I have no
idea what that would do in conjunction with mkgraph.py.  That might be
what's making the graphs look odd to me.

The 10-day span thing means 'Taking the data (ham count, spam count,
unsure, fp, fn, etc.) for 10 days at once, compute the total value
within the window and plot it.  Use a sliding window, so that for each
day, drop out the data 10 days old as you add in the data for the next
day.'

Using a 3-day span (to make the equations smaller), if you had the
data:

day:   1  2  3  4  5  6  7  8  9 10 11 12 13 14
ham:   1  3  2  4  2  5  6  2  3  2  3  2  1  5
spam:  8  9  8  3  9  8  9  8  9  8  3  7  8  9
fn:    2  1  2  1  2  1  2  3  2  3  2  1  3  1

Then you'd get the following values plotted:
day 1: fn % = (0 + 0 + 2) / ((0 + 0 + 1) + (0 + 0 + 8)) = 22.2%
day 2: fn % = (0 + 2 + 1) / ((0 + 1 + 3) + (0 + 8 + 9)) = 14.3%
day 3: fn % = (2 + 1 + 2) / ((1 + 3 + 2) + (8 + 9 + 8)) = 16.1%
day 4: fn % = (1 + 2 + 1) / ((3 + 2 + 4) + (9 + 8 + 3)) = 13.8%
etc.

The span plots give some idea of 'what is the performance at this time,
as the user would experience it', whereas the cumulative plots show, well,
the overall numbers as they mature.

- Alex