[Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.13,1.14
montanaro@users.sourceforge.net
montanaro@users.sourceforge.net
Tue, 27 Aug 2002 17:43:46 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29136
Modified Files:
GBayes.py
Log Message:
add simple trigram tokenizer - this seems to yield the best results I've
seen so far (but has not been extensively tested)
Index: GBayes.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/GBayes.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** GBayes.py 26 Aug 2002 18:55:26 -0000 1.13
--- GBayes.py 28 Aug 2002 00:43:44 -0000 1.14
***************
*** 108,111 ****
--- 108,116 ----
return tokenize_ngram(string, 15)
+ def tokenize_trigram(string):
+ r"""tokenize w/ re '[\w$-]+', result squished to 3-char runs"""
+ lst = "".join(_token_re.findall(string))
+ return tokenize_ngram(string, 3)
+
# add user-visible string as key and function as value - function's docstring
# serves as help string when -H is used, so keep it brief!
***************
*** 119,122 ****
--- 124,128 ----
"split": tokenize_split,
"split_fold": tokenize_split_foldcase,
+ "trigram": tokenize_trigram,
"words": tokenize_words,
"words_fold": tokenize_words_foldcase,