[spambayes-dev] problems locating messages with bigrams

Tue Jan 6 19:58:34 EST 2004

[Skip]
> ...
> Another apparently strongly hammy token (prob 0.092) had me confused
> for a bit.

It still has me confused.

> When I ran extractmessages.py to identify the messages containing
> 'bi:skip:w 20 skip:w 10', only two hams and two spams turned up.
> That should have resulted in a spamprob close to 0.5, not 0.1.

That *may* be true, but you haven't revealed enough to say whether it should
be true.  If, for example, you've trained on much more spam than ham, a
feature that appears twice in each kind will have a spamprob less than 0.5,
and closer to 0 the greater the training imbalance.

> I eventually figured out that the way I generate bigrams:
>
>     for t in Classifier()._enhance_wordstream(tokenize(msg)):
>         ...
>
> uses the current training database to decide which tokens should be
> generated, the leading & trailing unigrams or the bigram of the two.
> All possible bigrams are not generated.

We covered most of this before, but I'll add that the same code is used to
generate features for training (which is a different process than scoring):
a feature is in your training database if and only if _enhance_wordstream()
generated the feature during training of some message you trained on
(there's no concept of "tiling" during training, only during scoring).