[Spambayes] Outlook plugin - training

Tim Peters tim.one@comcast.net
Wed Nov 6 19:36:04 2002


[Moore, Paul]
> ...
> I can do this by regular retraining, but that has 2 disadvantages:
> it's much clumsier than simply clicking on a "clever boy!" button, and
> it relies on me not deleting messages until I do a training run. Much
> of the ham I get is "read and forget", so I'd rather delete
> immediately.
>
> When I get a chance to dive into the code, I'll see how hard this
> would be to implement.

Automatic training needs lots of work.  The Outlook client has gotten
smarter than anything else about this so far, but at the moment it's
basically automating "mistake based" training, which I think will prove to
be a Bad Idea over time.

Ideal is to train regularly on a random sample of all msgs, whether or not
correctly classified (I fake this by hand for now).  That presents some UI
and algorithmic challenges.

It will also create a database size problem:  without a strategy for pruning
useless words, the database will grow without bounds (an intuition that at a
certain non-fantastic size, "all words" will have been seen is incorrect for
computer-based indexing apps, and especially for email -- unique words keep
appearing and keep bloating the beast).  There's been no research done here
yet on how to prune a database over time without damaging accuracy.




More information about the Spambayes mailing list