[Spambayes] training suggestions

Thu Aug 3 22:04:08 CEST 2006

Interesting.

So now I am confused what the -f options is for. I dont use it because I dont want to
retrain everything, just the ones that have NOT been trained yet. Am I wrong in assuming
this?

Now that I think about it, the best thing would be to do the following:

1. message comes in to the filter
2. filter sorts it as ham or spam
3. db is trained with this message ham or spam (just this one message)

when the user sorts messages (put spam from inbox -> spam folder)
4. the changes to the db made in step 3 get undone and the message gets trained as spam
(similarly if the user moved form spam folder -> inbox)

Can spambayes do this? Can I specify just a message id or a list of message ids? ( I use
maildir format mail storage)

Thanks,
Dhaval

skip at pobox.com said:

> 
>     Dhaval> I also know that training the same messages twice is not a good
>     Dhaval> thing.  Are there flags which will not train any message which
>     Dhaval> has already been trained?
> 
> Dunno, but in the contrib directory of the CVS repository (does contrib make
> it into distributions?) there is a fuzzy checksum program (pycksum.py) I
> wrote a long time ago based upon a similar tool developed by the
> SpamAssassin folks.  If you pipe your mails through it before training it
> might do a reasonable job of deleting putative duplicates.  If they are true
> duplicates you can do something similar, just replace the guts of the
> generate_checksum() function with something like md5.checksum().
> 
>     Dhaval> If possible, it would be helpful if you show me the flags you
>     Dhaval> use when training initially after a fresh db is made, and the
>     Dhaval> flags you for ongoing training.
> 
> I never incrementally train.  I use my train-to-exhaustion script (tte.py,
> also in the contrib directory) fronted by a small shell script that sets
> parameters, cleans the mails, etc.  It doesn't sound like you do incremental
> training either.  I don't expect you will be able to use it as-is, but I've
> attached it as something you can use as a starting point for tte.py
> experimentation.
> 
> Skip
> 
> 

--