[Spambayes-checkins] spambayes tokenizer.py,1.16,1.17

Wed, 11 Sep 2002 17:16:09 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2055

Modified Files:
	tokenizer.py 
Log Message:
Added code to strip uuencoded sections.  As reported on the mailing list,
this has no effect on my results, except that one spam in now judged as
ham by all the other training sets.  It shrinks the database size by a
few percent, so that makes it a tiny win.  If Anthony Baxter doesn't
report better results on his data, I'll be sorely tempted to throw this
out again.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** tokenizer.py	11 Sep 2002 06:58:03 -0000	1.16
--- tokenizer.py	12 Sep 2002 00:16:07 -0000	1.17
***************
*** 747,750 ****
--- 747,787 ----
          yield '.'.join(parts[:i])
  
+ uuencode_begin_re = re.compile(r"""
+     ^begin \s+
+     (\S+) \s+   # capture mode
+     (\S+) \s*   # capture filename
+     $
+ """, re.VERBOSE | re.MULTILINE)
+ 
+ uuencode_end_re = re.compile(r"^end\s*\n", re.MULTILINE)
+ 
+ # Strip out uuencoded sections and produce tokens.  The return value
+ # is (new_text, sequence_of_tokens), where new_text no longer contains
+ # uuencoded stuff.  Note that we're not bothering to decode it!  Maybe
+ # we should.
+ def crack_uuencode(text):
+     new_text = []
+     tokens = []
+     i = 0
+     while True:
+         # Invariant:  Through text[:i], all non-uuencoded text is in
+         # new_text, and tokens contains summary clues for all uuencoded
+         # portions.  text[i:] hasn't been looked at yet.
+         m = uuencode_begin_re.search(text, i)
+         if not m:
+             new_text.append(text[i:])
+             break
+         start, end = m.span()
+         new_text.append(text[i : start])
+         mode, fname = m.groups()
+         tokens.append('uuencode mode:%s' % mode)
+         tokens.extend(['uuencode:%s' % x for x in crack_filename(fname)])
+         m = uuencode_end_re.search(text, end)
+         if not m:
+             break
+         i = m.end()
+ 
+     return ''.join(new_text), tokens
+ 
  class Tokenizer:
  
***************
*** 881,884 ****
--- 918,926 ----
              # Normalize case.
              text = text.lower()
+ 
+             # Get rid of uuencoded sections.
+             text, tokens = crack_uuencode(text)
+             for t in tokens:
+                 yield t
  
              # Special tagging of embedded URLs.