[spambayes-dev] [Spambayes] ZeroDivisionError with hammie.score()

Fri Jul 14 22:40:10 CEST 2006

[I'm moving this over to spambayes-dev because it deals more with the code]

On 7/13/06, Todd Kennedy <todd.kennedy at gmail.com> wrote:
> I'm trying to integrate the spambayes package into my blogging
> software as a comment spam filter.  I've read through a bunch of the
> source, looked at the scripts provided and stuff and have a
> rudimentary understanding of how the software works.  (i think).  but
> i'm getting a ZeroDivisionError when I try to run the score method of
> hammie.
>
> [...]
>
> The exception occurs at:
>  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
> line 320, in probability
>    prob = spamratio / (hamratio + spamratio)
> ZeroDivisionError: float division
>
> I put in some simple print statements to print out nham, nspam,
> spamcount and hamcount.  this is their output:
> 22:14:52 (~)
> todd at mothra> ./test_sp.py
> spamcount 6
> hamcount 6
> nham 6
> nspam 6
> spamcount 6
> hamcount 6
> spamcount 6
> hamcount 6
> spamcount 6
> hamcount 6
> spamcount 0
> hamcount 0
> nham 6
> nspam 6
>
> why would spamcount and hamcount go to 0?

>From the WordInfo class comments in classifier.py:

    # ... spamcount is the
    # number of trained spam msgs in which the word appears, and hamcount
    # the number of trained ham msgs.

So spamcount would be 0 if the current word has never been seen in a
trained spam message, and similarly for hamcount. A word will only
appear in the training database if it has appeared in at least one
message so you should never have a word with both counts 0. The
_worddistanceget() function in the Classifier class deals with this by
assigning a default probability to any word that does not appear in
the training data, so the probability calculation should only run on
trained words.

It's hard to say how the code might have ended up in the probability()
function with a word that wasn't in the training data. It might help
to print which word produced each of the spamcount/hamcount pairs and
compare those against the training data to see if there are any that
don't appear in the training.

It would also be interesting to know if you have ever tried to remove
a message from the training data (i.e. untrain the message). When a
message is removed, each word is checked to see if both counts have
gone to 0 (see the _remove_msg function) and the word should be
removed from the training data in that case. I see that you are using
the Postgres storage engine. I'm guessing a little here, but I don't
think Postgres has received as much testing as some of the other
storage formats so it might be possible that the record didn't
actually get deleted from the training database once both counts went
to 0.

-- 
Kenny Pitt