[spambayes-dev] problems locating messages with bigrams

Tue Jan 6 13:42:52 EST 2004

[Skip Montanaro]
> ...
> I eventually figured out that the way I generate bigrams:
>
>     for t in Classifier()._enhance_wordstream(tokenize(msg)):
>         ...
>
> uses the current training database to decide which tokens should be
> generated,

I think you're hallucinating here -- _enhance_wordstream() doesn't make any
use of training data.  Whenever tokenize() yields a stream of N tokens,
_enhance_wordstream() yields a derived stream of 2*N-1 tokens.

> the leading & trailing unigrams or the bigram of the two.  All
> possible bigrams are not generated.

A specific example would clarify what you think you mean by these phrases.
By the definition of bigrams *intended* by the code, only adjacent token
pairs can be pasted together into bigrams.  If the 4 incoming tokens are a b
c d, the 2*4-1 = 7 output tokens are

    a
    b
    bi:a b
    c
    bi:b c
    d
    bi:c d

and it doesn't matter whether a, b, c, and/or d have or haven't been trained
on previously.