[Spambayes] training WAS: aging information

Tim Stone - Four Stones Expressions tim at fourstonesExpressions.com
Tue Feb 18 12:48:07 EST 2003


2/18/2003 10:37:51 AM, François Granger <francois.granger at free.fr> wrote:

>on 18/02/03 16:12, D. R. Evans at N7DR at arrisi.com wrote:
>
>> I run in pop3proxy mode. The web page in that mode says that spambayes
>> stores all my incoming mail. Presumably this means "we store it until
>> you train on it" rather than "we store it for all time". I hope.
>
>As far as I remember, it keeps the last 7 days....

This is true.  If you pay no attention, stuff goes away after 7 days.

>
>> In any case, I'm trying to figure out whether it's possible to save
>> myself the increasingly-annoying chore of going to the web interface

An idea that we toyed with, and even made a prototype implementation, was to 
include an smtpproxy in the mix.  With that, you could train by forwarding a 
mail to spam@ or ham at .  This was very convenient, and eliminated much of the 
'increasingly-annoying chore' you refer to (which incidentally is part-and-
parcel of bayesian (machine learning) algorithms).  The problem with using an 
smtpproxy is that most mailers mess around with the headers. Some of them even 
lop almost all of them off.  There are many important clues in the headers, 
and these clues are simply missed by this mechanism.  So we chose to cache 
incoming mail and give a user interface, so training could be done on the 
intact mail.

But you bring up an interesting point, in that it's very possible that having 
to train will be viewed as an annoying chore by many people.  The smtpproxy 
might provide a much more convenient training mechanism.  We've also toyed 
with the idea of providing pretrained databases, so people don't have to start 
training from scratch.  Of course, the problem with this idea is that one 
man's spam is another man's subscription.  I feel, though, that we *could* 
come up with a few trained databases that would fit some reasonable 
definitions, like "no hardcore porn" for example.  For some people, Victoria's 
Secret would be included in that definition, for others it wouldn't.  But 
almost everyone agrees on the definition of hardcore porn at some level, and 
we may very well be able to provide such a database.

So, Doc, can you give us some feedback on these two ideas?  - TimS

>> and training spambayes at least once per day. Each time I do that, I
>> have to wade through a sea of subject lines, trying to figure out which
>> ones might have been misclassified.
>
>I usually click on the discard link in the head of the ham part. Then I look
>only at the spams and the unsure to check their classification and train on
>them. This is really quick.

This is what I do as well, except that when I get a fn in ham, I immediately 
go to the pop3proxy ui and train that one as spam.  I will stop even doing 
this when I'm satisfied with my fp/fn rate, and will then only train on 
mistakes, and occasionally on correctly classified stuff to be sure things 
don't get out of whack.  - TimS

>
>-- 
>Le courrier est un moyen de communication. Les gens devraient
>se poser des questions sur les implications politiques des choix (ou non
>choix) de leurs outils et technologies. Pour des courriers propres :
><http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes at python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org





More information about the Spambayes mailing list