[Spambayes-checkins] spambayes Options.py,1.53,1.54 tokenizer.py,1.47,1.48

Tim Peters tim_one@users.sourceforge.net
Fri, 25 Oct 2002 09:35:00 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12343

Modified Files:
	Options.py tokenizer.py 
Log Message:
Added new tokenizer option replace_nonascii_chars, false by default in
the core project, BUT TRUE BY DEFAULT IN THE OUTLOOK 2000 CLIENT!
Yanks and Aussies who don't normally correspond in Korean should find
this more effective with less training and less database burden at nailing
Asian spam.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.53
retrieving revision 1.54
diff -C2 -d -r1.53 -r1.54
*** Options.py	18 Oct 2002 21:38:16 -0000	1.53
--- Options.py	25 Oct 2002 16:34:16 -0000	1.54
***************
*** 100,103 ****
--- 100,111 ----
  generate_long_skips: True
  
+ # If true, replace high-bit characters (ord(c) >= 128) and control characters
+ # with question marks.  This allows non-ASCII character strings to be
+ # identified with little training and small database burden.  It's appropriate
+ # only if your ham is plain 7-bit ASCII, or nearly so, so that the mere
+ # presence of non-ASCII character strings is known in advance to be a strong
+ # spam indicator.
+ replace_nonascii_chars: False
+ 
  [TestDriver]
  # These control various displays in class TestDriver.Driver, and Tester.Test.
***************
*** 279,282 ****
--- 287,291 ----
                    'basic_header_tokenize_only': boolean_cracker,
                    'basic_header_skip': ('get', lambda s: Set(s.split())),
+                   'replace_nonascii_chars': boolean_cracker,
                   },
      'TestDriver': {'nbuckets': int_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.47
retrieving revision 1.48
diff -C2 -d -r1.47 -r1.48
*** tokenizer.py	22 Oct 2002 01:37:53 -0000	1.47
--- tokenizer.py	25 Oct 2002 16:34:19 -0000	1.48
***************
*** 739,742 ****
--- 739,757 ----
  # total unique fn went from 168 to 169
  
+ # For support of the replace_nonascii_chars option, build a string.translate
+ # table that maps all high-bit chars and control chars to a '?' character.
+ 
+ non_ascii_translate_tab = ['?'] * 256
+ # leave blank up to (but not including) DEL alone
+ for i in range(32, 127):
+     non_ascii_translate_tab[i] = chr(i)
+ # leave "normal" whitespace alone
+ for ch in ' \t\r\n':
+     non_ascii_translate_tab[ord(ch)] = ch
+ del i, ch
+ 
+ non_ascii_translate_tab = ''.join(non_ascii_translate_tab)
+ 
+ 
  def crack_content_xyz(msg):
      yield 'content-type:' + msg.get_content_type()
***************
*** 1002,1006 ****
                              yield 'received:' + tok
  
!         # Message-Id:  This seems to be a small win and should no
          # adversely affect a mixed source corpus so it's always enabled.
          msgid = msg.get("message-id", "")
--- 1017,1021 ----
                              yield 'received:' + tok
  
!         # Message-Id:  This seems to be a small win and should not
          # adversely affect a mixed source corpus so it's always enabled.
          msgid = msg.get("message-id", "")
***************
*** 1077,1080 ****
--- 1092,1099 ----
              for t in tokens:
                  yield t
+ 
+             if options.replace_nonascii_chars:
+                 # Replace high-bit chars and control chars with '?'.
+                 text = text.translate(non_ascii_translate_tab)
  
              # Special tagging of embedded URLs.