[spambayes-dev] RE: [Spambayes] question regarding training

Thu Aug 12 17:38:18 CEST 2004

[Seth Goodman]
>As you say, automating this is not easy.  There are no folders of
>confirmed ham or spam in the Outlook implementation to choose among.
>Using the dumb and ugly (tm) method, the additional ham or 
>spam the user
>selects to train are manually selected and are actually ham or spam.
>The text box could further suggest that they train on messages that
>scored furthest from perfect classification.

I think there *are* 99% confirmed classification folders; read (or older
than x days) messages in the "watch" folders, and read (or older than x
days) messages in the spam folder.

When I get a ham/spam imbalance, and need more hams trained, I do the
same thing. I sort my outlook inbox by the spam column, and find the
untrained "edge cases" to train on. That is, I find the hams that scored
just under my threshold (20%) and train on them.

It would seem to me that this process could be automated. We have a list
of folders Spambayes is watching, which presumably contain ham.
Spambayes knows where it stores spam, and we know which messages we've
already trained on. The outlook plug-in could just check the training
balance ratio each time it runs, and if it exceeds 1.5 or something, it
could go out and find more stuff to train on to even the load.

Of course, this will not work for people who don't keep around at least
a few hams/spams that were classified correctly. I don't know how to
solve that issue, other than to have the "autobalance" abort with an
error to the user, or simply do nothing.

I'd like to try my hand at Python and contribute this enhancement
myself, to see how it works. But I'm not very familiar with the
spambayes code base. Any idea which module/classes I should be looking
at for starters?

Regards,
	Ryan