[spambayes-dev] problems locating messages with bigrams
Tim Peters
tim.one at comcast.net
Tue Jan 6 19:58:34 EST 2004
[Skip]
> ...
> Another apparently strongly hammy token (prob 0.092) had me confused
> for a bit.
It still has me confused.
> When I ran extractmessages.py to identify the messages containing
> 'bi:skip:w 20 skip:w 10', only two hams and two spams turned up.
> That should have resulted in a spamprob close to 0.5, not 0.1.
That *may* be true, but you haven't revealed enough to say whether it should
be true. If, for example, you've trained on much more spam than ham, a
feature that appears twice in each kind will have a spamprob less than 0.5,
and closer to 0 the greater the training imbalance.
> I eventually figured out that the way I generate bigrams:
>
> for t in Classifier()._enhance_wordstream(tokenize(msg)):
> ...
>
> uses the current training database to decide which tokens should be
> generated, the leading & trailing unigrams or the bigram of the two.
> All possible bigrams are not generated.
We covered most of this before, but I'll add that the same code is used to
generate features for training (which is a different process than scoring):
a feature is in your training database if and only if _enhance_wordstream()
generated the feature during training of some message you trained on
(there's no concept of "tiling" during training, only during scoring).
More information about the spambayes-dev
mailing list