[spambayes-bugs] [ spambayes-Bugs-797890 ] Assertion errors from
classifier for new messages
SourceForge.net
noreply at sourceforge.net
Mon Sep 1 15:11:57 EDT 2003
Bugs item #797890, was opened at 2003-08-30 19:37
Message generated for change (Settings changed) made by richiehindle
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=797890&group_id=61702
Category: pop3proxy
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Anderson J. Vitous (avitous)
>Assigned to: Richie Hindle (richiehindle)
Summary: Assertion errors from classifier for new messages
Initial Comment:
Installed from CVS latest version as of 8/30/03 (1.0a4
wouldn't work for me so reinstalled). Python version is 2.
2.3 on Windows XP with latest pybsddb installed. Trained
with recent spam/ham collections (about equal numbers
of messages, approx. 400 each), configured proxy. Using
Mahogany mail client pointed to pop3proxy (localhost).
Each new message retrieved results in an assertion
failure:
AssertionError
Traceback (most recent call last):
File "D:\apps\spambayes\pop3proxy.py", line 439, in
onRetr
evidence=True)
File "D:\apps\spambayes\spambayes\classifier.py", line
223, in chi2_spamprob
clues = self._getclues(wordstream)
File "D:\apps\spambayes\spambayes\classifier.py", line
451, in _getclues
prob = self.probability(record)
File "D:\apps\spambayes\spambayes\classifier.py", line
307, in probability
assert hamcount <= nham
(from console running pop3proxy.py)
Also inserted into headers of each message:
X-Spambayes-Exception: exceptions.AssertionError() in
probability() at
D:\apps\spambayes\spambayes\classifier.py line
307: assert
hamcount <= nham
----------------------------------------------------------------------
>Comment By: Richie Hindle (richiehindle)
Date: 2003-09-01 21:11
Message:
Logged In: YES
user_id=85414
Fixed. After training on a message, we now ensure that
the classifier's state (nspam, nham) is written to the
database. Otherwise, if training goes wrong, you can
get a wordinfo whose count is greater than nspam/nham
- for instance, when training on a 100-message mailbox
file, if the 99th message caused an exception you could
get a clue with a ham count of 99 but an nham of 0 (or
whatever it was when you started training).
This is done for bsddb storage, and isn't needed for pickle
storage - SQL-based storage isn't done, because that
should probably be transaction-based and I'll leave that
for Skip/Tony/A. N. Other SQL-Storage Person. 8-)
----------------------------------------------------------------------
Comment By: Anderson J. Vitous (avitous)
Date: 2003-09-01 16:21
Message:
Logged In: YES
user_id=353986
My original install attempt was with 1.0a4 with default db
(dumbdbm) and I had corruption problems. I then downloaded
CVS snapshot, noted I needed pybsddb (running Python 2.2.3)
so I installed it before installing SpamBayes-cvs and
discovering the issue I reported here.. When I reverted to 1.
0a4 I used a fresh install, and it worked since pybsddb was
now present.
----------------------------------------------------------------------
Comment By: Richie Hindle (richiehindle)
Date: 2003-08-30 23:23
Message:
Logged In: YES
user_id=85414
Could you please clarify something? You say "adding pybsddb
helped", but your initial problem description says "...with
latest
pybsddb installed." When you were seeing the problems, did
you have pybsddb installed or not?
----------------------------------------------------------------------
Comment By: Anderson J. Vitous (avitous)
Date: 2003-08-30 23:19
Message:
Logged In: YES
user_id=353986
Just got 1.0a4 working (adding pybsddb helped) and although
it has a particular training issue thru smtp proxy it works
against the corpuses which caused the error documented here
in CVS snapshot version.
----------------------------------------------------------------------
Comment By: Anderson J. Vitous (avitous)
Date: 2003-08-30 21:55
Message:
Logged In: YES
user_id=353986
I'd love to help track this down, but I cannot send you my
corpuses (private data) and don't have time right now to
'sanitize' it. Perhaps later this weekend I can find the time,
but I don't want to give away anybody's emails in the process.
I'm training through the web interface, with culled most-recent
message data for spam/ham corpuses. No errors show up
during this exercise; messages are imported as a single mbox
file for each corpus. Subsequently I can query on various
words and the responses seem to be reasonable.
Error shows up when I subsequently shut down, restart
pop3proxy.py (was having problem with smtp proxy not
working with 1.0a4 so got used to doing that...), connect with
my mail client, and retrieve new messages; every message
results in the assertion error.
Please let me know what else I can do besides sending private
data: what kind of debug logs can be written while exercise
this?
----------------------------------------------------------------------
Comment By: Richie Hindle (richiehindle)
Date: 2003-08-30 20:01
Message:
Logged In: YES
user_id=85414
This is a bug that's been cropping up from time to time,
but we haven't been able to reproduce it. It sounds from
your description that you have a way to reproduce it - just
train on your corpuses and it fails straight away...? If that's
true, could you attach your corpuses to this bug report?
Or if you've private messages in there, would you be willing
to send the corpuses to me directly? I'd love to be able to
track this one down.
If you do send/attach your corpuses, please zip them up
to guarantee they don't get mangled by intermediate
mail/web servers.
How are you training? Through the web, or on the command
line? If on the command line, what is the exact command
you're using?
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=797890&group_id=61702
More information about the Spambayes-bugs
mailing list