[spambayes-dev] Question about tokenize_word and
Tokenizer.tokenize_body
Daniel Lyons
fusion at clanspum.net
Wed Nov 10 05:38:33 CET 2004
Hi,
At the very end of spambayes/tokenizer.py (version 1.33) in
Tokenizer.tokenize_body, are lines 1593 to 1601:
for w in text.split():
n = len(w)
# Make sure this range matches in tokenize_word().
if 3 <= n <= maxword:
yield w
elif n >= 3:
for t in tokenize_word(w):
yield t
The lines inside the for loop there mirror those of the function
tokenize_word, lines 690-695:
def tokenize_word(word, _len=len, maxword=options["Tokenizer",
"skip_max_word_size"]):
n = _len(word)
# Make sure this range matches in tokenize().
if 3 <= n <= maxword:
yield word
This leads me to believe that tokens found in the body text are being
generated twice by the tokenizer. This of course isn't causing problems
in the classifier because it uses the unique tokenlist, unlike Graham's
mechanism, using a set object.
But, these functions both contain a comment referring to each other
about the range being the same. I'm unclear on the benefit of
duplicating the code since ultimately "all roads lead to Rome," that is,
via tokenize_word. What's the real purpose to this duplicated effort?
Thanks in advance,
--
Daniel
http://www.storytotell.org -- Tell It!
More information about the spambayes-dev
mailing list