[spambayes-dev] Question about tokenize_word and Tokenizer.tokenize_body

Daniel Lyons fusion at clanspum.net
Wed Nov 10 05:38:33 CET 2004


Hi,

At the very end of spambayes/tokenizer.py (version 1.33) in 
Tokenizer.tokenize_body, are lines 1593 to 1601:

            for w in text.split():
                n = len(w)
                # Make sure this range matches in tokenize_word().
                if 3 <= n <= maxword:
                    yield w

                elif n >= 3:
                    for t in tokenize_word(w):
                        yield t

The lines inside the for loop there mirror those of the function 
tokenize_word, lines 690-695:

def tokenize_word(word, _len=len, maxword=options["Tokenizer",
                                                  "skip_max_word_size"]):
    n = _len(word)
    # Make sure this range matches in tokenize().
    if 3 <= n <= maxword:
        yield word

This leads me to believe that tokens found in the body text are being 
generated twice by the tokenizer.  This of course isn't causing problems 
in the classifier because it uses the unique tokenlist, unlike Graham's 
mechanism, using a set object.

But, these functions both contain a comment referring to each other 
about the range being the same.  I'm unclear on the benefit of 
duplicating the code since ultimately "all roads lead to Rome," that is, 
via tokenize_word.  What's the real purpose to this duplicated effort?

Thanks in advance,

-- 
Daniel
http://www.storytotell.org -- Tell It!



More information about the spambayes-dev mailing list