[spambayes-dev] [Spambayes] ZeroDivisionError with hammie.score()
Todd Kennedy
todd.kennedy at gmail.com
Sat Jul 15 23:56:19 CEST 2006
Tim,
Thanks for the reply. I understand what you're talking about with
papering over the problem.
I've included the full traceback that you get when you run the script
I provided. Hopefully this will provide some information. Any ideas
on how to resolve this would be great -- I'm moderately new to Python.
Also, I upgraded to 1.1a2 and it's still occuring...
17:53:27 (~/src/spambayes)
todd at mothra> ./test.py
Traceback (most recent call last):
File "./test.py", line 9, in ?
h.filter('do you want some viagra')
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/hammie.py",
line 155, in filter
debug, train)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/hammie.py",
line 109, in score_and_filter
prob, clues = self._scoremsg(msg, True)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/hammie.py",
line 38, in _scoremsg
return self.bayes.spamprob(tokenize(msg), evidence)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 196, in chi2_spamprob
clues = self._getclues(wordstream)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 499, in _getclues
tup = self._worddistanceget(word)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 514, in _worddistanceget
prob = self.probability(record)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 320, in probability
prob = spamratio / (hamratio + spamratio)
ZeroDivisionError: float division
On 7/14/06, Tim Peters <tim.peters at gmail.com> wrote:
> [Todd Kennedy]
> > With the definitions of spamcount and hamcount it makes sense that
> > they might be zero, since there is minimal training data in the
> > system, and the word being scored does not exist in the database.
> >
> > This might be some sort of small bug with running the filter on a
> > small amount of data, as I can reliably replicate a divide by zero
> > error. If spamcount and hamcount are both zero, shouldn't the system
> > return some sort of 0% probability for spam or ham (showing it's
> > uncertainty for the phrase being scored)?
>
> Yes, and it does. That's what Kenny tried to tell you :-) This is
> Classifier._worddistanceget():
>
> def _worddistanceget(self, word):
> record = self._wordinfoget(word)
> if record is None:
> prob = options["Classifier", "unknown_word_prob"]
> else:
> prob = self.probability(record)
> distance = abs(prob - 0.5)
> return distance, prob, word, record
>
> If there is no record for the word, then this returns the value of the
> "unknown_word_prob" option. It only tries to _compute_ the
> probability if there _is_ a record for the word, and it should never
> be the case that a record exists for a word with hamcount and
> spamcount both 0.
>
> It would be helpful to dump print statements into that function (or
> run under Python's debugger) to see exactly which word it is and
> what's in that record -- or possibly you'd discover that
> _worddistanceget() isn't being called at all. You didn't include a
> complete traceback in your original message, so it's impossible from
> here to guess who called probability() to begin with. A complete
> traceback would help.
>
> > ...
> > If change line 320 of classify.py (i'm using the latest 1.1a1 release
> > now) to a very simple try/except clause:
> > try:
> > prob = spamratio / (hamratio + spamratio)
> > except:
> > prob = 0
> >
> > You can't replicate the error with the above script.
> >
> > Is this a patch that should be submitted?
>
> No, because that slows down a speed-critical function to paper over a
> problem that should never occur. The bug isn't that this is dividing
> by 0, the bug is that probability() is being _called_ when both counts
> are 0. Something, somewhere, on the path _toward_ calling
> probability() is in error.
>
More information about the spambayes-dev
mailing list