[Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.13,1.14

montanaro@users.sourceforge.net montanaro@users.sourceforge.net
Tue, 27 Aug 2002 17:43:46 -0700


Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29136

Modified Files:
	GBayes.py 
Log Message:
add simple trigram tokenizer - this seems to yield the best results I've
seen so far (but has not been extensively tested)



Index: GBayes.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/GBayes.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** GBayes.py	26 Aug 2002 18:55:26 -0000	1.13
--- GBayes.py	28 Aug 2002 00:43:44 -0000	1.14
***************
*** 108,111 ****
--- 108,116 ----
      return tokenize_ngram(string, 15)
  
+ def tokenize_trigram(string):
+     r"""tokenize w/ re '[\w$-]+', result squished to 3-char runs"""
+     lst = "".join(_token_re.findall(string))
+     return tokenize_ngram(string, 3)
+ 
  # add user-visible string as key and function as value - function's docstring
  # serves as help string when -H is used, so keep it brief!
***************
*** 119,122 ****
--- 124,128 ----
      "split": tokenize_split,
      "split_fold": tokenize_split_foldcase,
+     "trigram": tokenize_trigram,
      "words": tokenize_words,
      "words_fold": tokenize_words_foldcase,