[Spambayes] RE: Help! Imapfilter and mysql/pickle woes

Tony Meyer tameyer at ihug.co.nz
Wed Jan 19 01:25:44 CET 2005


> 1) Using a pickle dbm with sb_imapfilter.py is regularly 
> resulting in a corrupt database within days of wiping it out and starting 
> over. I can get about a week out of the database before it corrupts
> and fails with an assertion error.

This is 1.0.1 sb_imapfilter, yes?  It would be worth giving CVS
sb_imapfilter a go - it should be vastly improved.  I've tried to copy most
bugfixes over to the 1.0.x branch, but that's not been possible when there
are large changes.

I also heard today that using Python 2.4 helps, which I suspect means there
is a problem handling malformed messages.  If using Python 2.4 is easy to
do, then it would be worth doing.

> 2) I've been trying to get the mysql option to work for 
> sb_imapfilter.py on and off for a couple months, but I am still stuck:
>
> First off, regardless of what iteration I try, I cannot seem 
> to specify any DSN other than the default. When I try to specify
> a custom DSN, something happens in the code when it parses the
> values so that the user field is blank, so that the result is
> user '@localhost' tries to log onto mysql without success.

I believe this is caused by a known bug.  It's fixed in CVS for 1.1, but
hasn't been backported.  If you like I can do so, so that the fix is in
1.0.2.  I believe you can work around it by putting a space at the start of
the DSN.

> Upon giving credentials to the default DSN used by the
> script, I can actually get sb_imapfilter.py to train on a 
> sample of spam and ham successfully, but immediately afterwards,
> when I try to actually run sb_imapfilter.py to filter my inbox, it
> fails with the dreaded "Token seen in more spam than spam trained."
> assertion error:

If you have the patience, try doing this:

  0. Clear the ham & spam training folders.
  1. Put one (more) message in each of the ham and spam training folder.
  2. Run sb_imapfilter.py -t.
  3. Do a 'select * from spambayes where word="saved_state"' query against
the database, and check that the values are the same as the number of
messages in the folders (i.e. 1,1, then 2,2, then 3,3, ...).
  4. Repeat from 1.

It would help to know if it dies out quickly (like with a single message) or
not.  If you get to high numbers and it's still working, try adding multiple
messages at a time, and see if the counts still match.

I assume that sb_imapfilter always finishes without error, and isn't
interrupted while training?

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.



More information about the Spambayes mailing list