[spambayes-bugs] [ spambayes-Bugs-1101281 ] imapfilter with mysql on mac has assertion error

SourceForge.net noreply at sourceforge.net
Thu Jan 13 16:17:46 CET 2005


Bugs item #1101281, was opened at 2005-01-12 22:54
Message generated for change (Comment added) made by jscottjfs
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1101281&group_id=61702

Category: imapfilter
Group: 1.0.1
Status: Open
Resolution: None
Priority: 5
Submitted By: jscott (jscottjfs)
Assigned to: Tony Meyer (anadelonbrin)
Summary: imapfilter with mysql on mac has assertion error

Initial Comment:
Using persistent_use_database=False trains OK

Keeping everything else the same, and switching to
mysql leads to the following errors (or similar
assertion errors involving nspam instead) after a
couple minutes of training.

the mysql database has this:

mysql> describe bayes;
+-------+--------------+------+-----+---------+-------+
| Field   | Type          | Null  | Key | Default |
Extra  |
+-------+--------------+------+-----+---------+-------+
| word  | varchar(255) |      | PRI |         |       |
| nspam | int(11)        |      |      | 0       |       |
| nham  | int(11)        |      |      | 0       |       |
+-------+--------------+------+-----+---------+-------+

and 


mysql> select count(word) from bayes;
+-------------+
| count(word) |
+-------------+
|       20125 |
+-------------+


so everything is working well.  Then, somehow, the training
training runs amuck crashing imapfilter and giving this:



[dhcp-235-023:~/spambayes-1.0.1/scripts] jscott% python
sb_imapfilter.py -c -t -l -5
SpamBayes IMAP Filter Version 0.5 (November 2004)
and engine SpamBayes Engine Version 0.3 (January 2004).

Traceback (most recent call last):
  File "sb_imapfilter.py", line 924, in ?
    run()
  File "sb_imapfilter.py", line 914, in run
    imap_filter.Filter()
  File "sb_imapfilter.py", line 785, in Filter
    self.unsure_folder)
  File "sb_imapfilter.py", line 703, in Filter
    evidence=True)
  File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 190, in chi2_spamprob
    clues = self._getclues(wordstream)
  File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 493, in _getclues
    tup = self._worddistanceget(word)
  File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 508, in _worddistanceget
    prob = self.probability(record)
  File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 308, in probability
    assert hamcount <= nham
AssertionError


----------------------------------------------------------------------

>Comment By: jscott (jscottjfs)
Date: 2005-01-13 15:17

Message:
Logged In: YES 
user_id=1196022

Tried two different training folders with no messages
in common with the first attempt.  Assertion failed on
nspam this time (although it looks like nham was headed 
for a fall, too).  Here are the database numbers followed 
by the traceback.

If I get a chance, I'll try the CVS version of imapfilter.
Also, I'm using the mysql-python 0.9.3 (I think) downloaded
via the MacPython package manager database.  I see they've
moved on to version 1.0 for the new year.

mysql> select count(word) from bayes;
+-------------+
| count(word) |
+-------------+
|       13850 |
+-------------+
1 row in set (0.00 sec)

 mysql> select * from bayes where word="saves state";
Empty set (0.00 sec)

mysql> select * from bayes where word="saved state";
+-------------+-------+------+
| word        | nspam | nham |
+-------------+-------+------+
| saved state |   129 |  339 |
+-------------+-------+------+
1 row in set (0.01 sec)

mysql> select * from bayes where nspam > 129;
+---------------------+-------+------+
| word                | nspam | nham |
+---------------------+-------+------+
| header:Message-ID:1 |   130 |  339 |
+---------------------+-------+------+
1 row in set (0.09 sec)

mysql> select * from bayes where nham > 339;
+----------+-------+------+
| word     | nspam | nham |
+----------+-------+------+
| subject: |    99 |  468 |
+----------+-------+------+
1 row in set (0.08 sec)


SpamBayes IMAP Filter Version 0.5 (November 2004)
and engine SpamBayes Engine Version 0.3 (January 2004).

Traceback (most recent call last):
  File "sb_imapfilter.py", line 924, in ?
    run()
  File "sb_imapfilter.py", line 914, in run
    imap_filter.Filter()
  File "sb_imapfilter.py", line 785, in Filter
    self.unsure_folder)
  File "sb_imapfilter.py", line 703, in Filter
    evidence=True)
  File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 190, in chi2_spamprob
    clues = self._getclues(wordstream)
  File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 493, in _getclues
    tup = self._worddistanceget(word)
  File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 508, in _worddistanceget
    prob = self.probability(record)
  File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 311, in probability
    assert spamcount <= nspam
AssertionError

----------------------------------------------------------------------

Comment By: jscott (jscottjfs)
Date: 2005-01-13 12:17

Message:
Logged In: YES 
user_id=1196022

Retraining from scratch isn't an option, since this error
occurs on initial training with the mysql database, every time. 
Although I blow away the hammie and message files, and
recreate the 
spambayes database everytime I start a new attempt, perhaps
there is an initialization step for the mysql database that
I don't know about?

It may be a one off.  I'll try another couple of training
folders to see if the corruption could be caused by a
pathological message in my baseline training folders. Still,
hard to see how that would affect mysql, but not the pickle.

mysql> select * from bayes where word="saved state";
+-------------+-------+------+
| word        | nspam | nham |
+-------------+-------+------+
| saved state |   263 |  156 |
+-------------+-------+------+
1 row in set (0.00 sec)

mysql> select * from bayes where nham > 156;
+----------+-------+------+
| word     | nspam | nham |
+----------+-------+------+
| subject: |   208 |  177 |
+----------+-------+------+
1 row in set (0.30 sec)

mysql> select * from bayes where nspam > 263;
Empty set (0.10 sec)

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2005-01-13 00:54

Message:
Logged In: YES 
user_id=552329

(Opps.  I skimmed your message too fast - there aren't any
values for nspam/nham there, just the defaults).  Most of my
earlier comment is still correct, anyway.

Try having a look (select * from bayes where word="saved
state") at the nham/nspam values.  They should be at least
as large as any individual counts.  You can manually correct
them if you like, but it's generally a better idea to
retrain from scratch.

This might be a one-off problem, but if it does reoccur then
we can try and figure out what's causing the problem.  You
could also try using spambayes from CVS, which has a much
improved sb_imapfilter (which will be in the 1.1 release).

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2005-01-12 23:13

Message:
Logged In: YES 
user_id=552329

The problem is that nham (nspam) is meant to be the total
number of ham (spam) messages that you have trained.  It
looks like it's 0 above, which is not good.

Offhand, I'm not sure what would cause this - updating the
nham/nspam values is done at the same time as the token
counts, so if one is wrong, they really ought to both be.

I'll try and find time to try and replicate this here later
today and update with what happens.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1101281&group_id=61702


More information about the Spambayes-bugs mailing list