[Spambayes] training WAS: aging information

Tue Feb 18 20:31:53 EST 2003

2/18/2003 7:56:38 PM, "D. R. Evans" <N7DR at arrisi.com> wrote:

>On 18 Feb 2003 at 12:48, Tim Stone - Four Stones Expressions wrote:
>
>> >As far as I remember, it keeps the last 7 days....
>>
>> This is true.  If you pay no attention, stuff goes away after 7 days.
>>
>
>That's definitely worth knowing. Thanks.
>
>> >
>> >> In any case, I'm trying to figure out whether it's possible to save
>> >> myself the increasingly-annoying chore of going to the web interface
>>
>> An idea that we toyed with, and even made a prototype implementation,
>> was to include an smtpproxy in the mix.  With that, you could train by
>> forwarding a mail to spam@ or ham at .  This was very convenient, and
>> eliminated much of the 'increasingly-annoying chore' you refer to (which
>> incidentally is part-and- parcel of bayesian (machine learning)
>> algorithms).  The problem with using an smtpproxy is that most mailers
>> mess around with the headers. Some of them even lop almost all of them
>> off.  There are many important clues in the headers, and these clues are
>> simply missed by this mechanism.  So we chose to cache incoming mail and
>> give a user interface, so training could be done on the intact mail.
>>
>> But you bring up an interesting point, in that it's very possible that
>> having to train will be viewed as an annoying chore by many people.  The
>
>What was really concerning me was that I had seen no indication that it
>was permissible simply to stop training -- and that if I did so, the
>system wouldn't just store incoming e-mails forever.
>
>So the first stop-gap solution is simple: somewhere state clearly that
>once the filter is working to a user's satisfaction, the user can stop
>training.
>
>Then the problem will be what to do when a spam gets through (or a ham
>doesn't). Obviously (if anything is truly obvious) the user will want
>to train on that one particular mail. The current interface would make
>this a nightmare -- There am I sitting with 7 days worth of e-mail
>(which in my case would be something like 1500 messages) and I want to
>find the one that has been misclassified.

Very good point.  I hear ya, and I'll start trying to figure out a way to 
accomplish this... it'll take a while, though, cause our pop3proxy guy, 
Richie, is out of circulation for a while... 

>
>So it seems to me that there has to be something like the smtpproxy
>thing. But then I'm biased: my MUA doesn't delete headers. (Actually, I
>was unaware that any mailers did that sort of thing; but I readily
>admit that I'm a naïve rustic.)
>
>> smtpproxy might provide a much more convenient training mechanism.
>> We've also toyed with the idea of providing pretrained databases, so
>> people don't have to start training from scratch.  Of course, the
>
>I don't really like that idea very much. I'm trying to come up with a
>logical explanation for that feeling, though, and not doing very well.
>This is the best I can do:
>
>I am impressed at how quickly spambayes has moved toward near 100%
>accuracy on my system. (So far today it has classified a single spam as
>unsure; everything else has been classified correctly.) If I had
>started from a pre-seeded database, it isn't at all clear that it could
>have converged to my idea of spam as quickly as starting from an empty
>database. Obviously, the experiment could be done to see if it really
>is worth it, but I suspect that all of us have better things to do than
>to grab a ton of spam and build some filters. Maybe I'm wrong. I
>frequently am :-)

The thing I'm concerned about is that we really have only tapped people who 
are very saavy, and that the system will ultimately still be too difficult to 
comprehend for the 'average joe' user, who stresses out when installing the 
latest release of solitaire.  (I have much experience with this syndrome.)  
This is *definitely* the case with the current state of the system.  The vast 
majority of people pretty much expect to run setup.exe and it just 
miraculously works.  - TimS

>
>> stop even doing this when I'm satisfied with my fp/fn rate, and will
>> then only train on mistakes, and occasionally on correctly classified
>> stuff to be sure things don't get out of whack.  - TimS
>>
>
>I saw a comment in the LJ article that one should train on roughly
>equal numbers of spam and ham. Is this actually true? (This question of
>course merely demonstrates that I'm too lazy to do the maths myself.)

You should shoot for a relative balance, but our research seems to indicate 
that the system isn't particularly sensitive to anything but extreme 
imbalance.  Tim Peters can fill us in a bit more on this one, if he's 
watching.  Tim?  Tim?  Where are you?  - TimS

>
>One thing I've learned by doing the training is that approximately 10%
>of my mail is spam. I'm surprised, because I would have guessed that
>the proportion was lower than that. I guess that I had got to the point
>where I mentally just filtered it out of consciousness as I clicked the
>"delete" button every morning on the night's accumulation of the stuff.
>
>I really am going to have to try to find time to do the aging thing,
>though. I want to experiment with classifying off-thread postings to
>reflectors as spam :-) I suspect that it won't work very well, but the
>experiment seems like it's worth a try.

Please do!

>
>  Doc
>--------------------------------------------------------------
>Phone:  +1 303 494 0394
>Mobile: +1 720 839 8462
>Fax:    +1 781 240 0527
>--------------------------------------------------------------
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes at python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org