[spambayes-bugs] [ spambayes-Bugs-1101281 ] imapfilter with mysql
on mac has assertion error
SourceForge.net
noreply at sourceforge.net
Thu Jan 13 16:17:46 CET 2005
Bugs item #1101281, was opened at 2005-01-12 22:54
Message generated for change (Comment added) made by jscottjfs
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1101281&group_id=61702
Category: imapfilter
Group: 1.0.1
Status: Open
Resolution: None
Priority: 5
Submitted By: jscott (jscottjfs)
Assigned to: Tony Meyer (anadelonbrin)
Summary: imapfilter with mysql on mac has assertion error
Initial Comment:
Using persistent_use_database=False trains OK
Keeping everything else the same, and switching to
mysql leads to the following errors (or similar
assertion errors involving nspam instead) after a
couple minutes of training.
the mysql database has this:
mysql> describe bayes;
+-------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default |
Extra |
+-------+--------------+------+-----+---------+-------+
| word | varchar(255) | | PRI | | |
| nspam | int(11) | | | 0 | |
| nham | int(11) | | | 0 | |
+-------+--------------+------+-----+---------+-------+
and
mysql> select count(word) from bayes;
+-------------+
| count(word) |
+-------------+
| 20125 |
+-------------+
so everything is working well. Then, somehow, the training
training runs amuck crashing imapfilter and giving this:
[dhcp-235-023:~/spambayes-1.0.1/scripts] jscott% python
sb_imapfilter.py -c -t -l -5
SpamBayes IMAP Filter Version 0.5 (November 2004)
and engine SpamBayes Engine Version 0.3 (January 2004).
Traceback (most recent call last):
File "sb_imapfilter.py", line 924, in ?
run()
File "sb_imapfilter.py", line 914, in run
imap_filter.Filter()
File "sb_imapfilter.py", line 785, in Filter
self.unsure_folder)
File "sb_imapfilter.py", line 703, in Filter
evidence=True)
File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 190, in chi2_spamprob
clues = self._getclues(wordstream)
File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 493, in _getclues
tup = self._worddistanceget(word)
File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 508, in _worddistanceget
prob = self.probability(record)
File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 308, in probability
assert hamcount <= nham
AssertionError
----------------------------------------------------------------------
>Comment By: jscott (jscottjfs)
Date: 2005-01-13 15:17
Message:
Logged In: YES
user_id=1196022
Tried two different training folders with no messages
in common with the first attempt. Assertion failed on
nspam this time (although it looks like nham was headed
for a fall, too). Here are the database numbers followed
by the traceback.
If I get a chance, I'll try the CVS version of imapfilter.
Also, I'm using the mysql-python 0.9.3 (I think) downloaded
via the MacPython package manager database. I see they've
moved on to version 1.0 for the new year.
mysql> select count(word) from bayes;
+-------------+
| count(word) |
+-------------+
| 13850 |
+-------------+
1 row in set (0.00 sec)
mysql> select * from bayes where word="saves state";
Empty set (0.00 sec)
mysql> select * from bayes where word="saved state";
+-------------+-------+------+
| word | nspam | nham |
+-------------+-------+------+
| saved state | 129 | 339 |
+-------------+-------+------+
1 row in set (0.01 sec)
mysql> select * from bayes where nspam > 129;
+---------------------+-------+------+
| word | nspam | nham |
+---------------------+-------+------+
| header:Message-ID:1 | 130 | 339 |
+---------------------+-------+------+
1 row in set (0.09 sec)
mysql> select * from bayes where nham > 339;
+----------+-------+------+
| word | nspam | nham |
+----------+-------+------+
| subject: | 99 | 468 |
+----------+-------+------+
1 row in set (0.08 sec)
SpamBayes IMAP Filter Version 0.5 (November 2004)
and engine SpamBayes Engine Version 0.3 (January 2004).
Traceback (most recent call last):
File "sb_imapfilter.py", line 924, in ?
run()
File "sb_imapfilter.py", line 914, in run
imap_filter.Filter()
File "sb_imapfilter.py", line 785, in Filter
self.unsure_folder)
File "sb_imapfilter.py", line 703, in Filter
evidence=True)
File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 190, in chi2_spamprob
clues = self._getclues(wordstream)
File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 493, in _getclues
tup = self._worddistanceget(word)
File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 508, in _worddistanceget
prob = self.probability(record)
File
"/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 311, in probability
assert spamcount <= nspam
AssertionError
----------------------------------------------------------------------
Comment By: jscott (jscottjfs)
Date: 2005-01-13 12:17
Message:
Logged In: YES
user_id=1196022
Retraining from scratch isn't an option, since this error
occurs on initial training with the mysql database, every time.
Although I blow away the hammie and message files, and
recreate the
spambayes database everytime I start a new attempt, perhaps
there is an initialization step for the mysql database that
I don't know about?
It may be a one off. I'll try another couple of training
folders to see if the corruption could be caused by a
pathological message in my baseline training folders. Still,
hard to see how that would affect mysql, but not the pickle.
mysql> select * from bayes where word="saved state";
+-------------+-------+------+
| word | nspam | nham |
+-------------+-------+------+
| saved state | 263 | 156 |
+-------------+-------+------+
1 row in set (0.00 sec)
mysql> select * from bayes where nham > 156;
+----------+-------+------+
| word | nspam | nham |
+----------+-------+------+
| subject: | 208 | 177 |
+----------+-------+------+
1 row in set (0.30 sec)
mysql> select * from bayes where nspam > 263;
Empty set (0.10 sec)
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2005-01-13 00:54
Message:
Logged In: YES
user_id=552329
(Opps. I skimmed your message too fast - there aren't any
values for nspam/nham there, just the defaults). Most of my
earlier comment is still correct, anyway.
Try having a look (select * from bayes where word="saved
state") at the nham/nspam values. They should be at least
as large as any individual counts. You can manually correct
them if you like, but it's generally a better idea to
retrain from scratch.
This might be a one-off problem, but if it does reoccur then
we can try and figure out what's causing the problem. You
could also try using spambayes from CVS, which has a much
improved sb_imapfilter (which will be in the 1.1 release).
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2005-01-12 23:13
Message:
Logged In: YES
user_id=552329
The problem is that nham (nspam) is meant to be the total
number of ham (spam) messages that you have trained. It
looks like it's 0 above, which is not good.
Offhand, I'm not sure what would cause this - updating the
nham/nspam values is done at the same time as the token
counts, so if one is wrong, they really ought to both be.
I'll try and find time to try and replicate this here later
today and update with what happens.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1101281&group_id=61702
More information about the Spambayes-bugs
mailing list