[Spambayes] So many spams, so little ham
Skip Montanaro
skip at pobox.com
Mon Jun 28 09:44:57 EDT 2004
Someone> I have trained 253 ham and 1181 spam.
Tony> Note that, in general, you'll have better results training roughly
Tony> the same amount of ham and spam.
Amir> I'm really trying to, but since nowadays I get much more spam than
Amir> ham (like many people do), I cannot really keep the numbers
Amir> balanced anymore. Is this really a problem?
I've solved this problem in my train-to-exhaustion script, contrib/tte.py.
I don't think this will help people who use Outlook unless you have some way
to save your spam and ham to external mailboxes and retrain from scratch.
Tte.py ruthlessly trains hams and spams in pairs, skipping any leftovers at
the end. The -R flag causes it to work its way backward through the
mailboxes. In general, this means that it trains on new messages in
preference to old messages. The -c flag causes it to write out new
mailboxes, culling messages which were considered but scored correctly in
each round. This has the nice effect that you don't need to worry if you
have two of the same sort of ham or spam. The script will automatically
cull those messages which will have no effect on training.
Finally, if the generated database gets bigger than I'd like, I visit the
ham and spam collections in my mail program, sort by date and toss out a few
old messages from each collection.
I've attached the shell script (tte.sh) I use to drive tte.py. In the
common case I run it like so. (I suppose I could bury mv and touch commands
into tte.sh but I haven't.)
cd ~/tmp
# the .cull files are those saved from the previous tte run
mv newham.old.cull newham.old
mv newspam.old.cull newspam.old
# just to guarantee we'll run the main loop
touch newham
tte.sh
You won't be able to use tte.sh as-is, but it may be useful as a jumping off
point.
Skip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 2200 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20040628/5e8fd007/attachment.obj
More information about the Spambayes
mailing list