[Spambayes-checkins] spambayes tokenizer.py,1.3,1.4

Sun, 08 Sep 2002 01:08:04 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24720

Modified Files:
	tokenizer.py 
Log Message:
Add results from latest experiments with tokenization and HTML stripping.

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** tokenizer.py	7 Sep 2002 19:44:31 -0000	1.3
--- tokenizer.py	8 Sep 2002 08:08:02 -0000	1.4
***************
*** 205,209 ****
  #
  #    total unique fn went from 292 to 302
! 

  ##############################################################################
--- 205,299 ----
  #
  #    total unique fn went from 292 to 302
! #
! # Later:  Here's another tokenization scheme with more promise.
! #
! #     fold case, ignore punctuation, strip a trailing 's' from words (to
! #     stop Guido griping about "hotel" and "hotels" getting scored as
! #     distinct clues <wink>) and save both word bigrams and word unigrams
! #
! # This was the code:
! #
! #     # Tokenize everything in the body.
! #     lastw = ''
! #     for w in word_re.findall(text):
! #         n = len(w)
! #         # Make sure this range matches in tokenize_word().
! #         if 3 <= n <= 12:
! #             if w[-1] == 's':
! #                 w = w[:-1]
! #             yield w
! #             if lastw:
! #                 yield lastw + w
! #             lastw = w + ' '
! #
! #         elif n >= 3:
! #             lastw = ''
! #             for t in tokenize_word(w):
! #                 yield t
! #
! # where
! #
! #     word_re = re.compile(r"[\w$\-\x80-\xff]+")
! #
! # This at least doubled the process size.  It helped the f-n rate
! # significantly, but probably hurt the f-p rate (the f-p rate is too low
! # with only 4000 hams per run to be confident about changes of such small
! # *absolute* magnitude -- 0.025% is a single message in the f-p table):
! #
! # false positive percentages
! #     0.000  0.000  tied
! #     0.000  0.075  lost  +(was 0)
! #     0.050  0.125  lost  +150.00%
! #     0.025  0.000  won   -100.00%
! #     0.075  0.025  won    -66.67%
! #     0.000  0.050  lost  +(was 0)
! #     0.100  0.175  lost   +75.00%
! #     0.050  0.050  tied
! #     0.025  0.050  lost  +100.00%
! #     0.025  0.000  won   -100.00%
! #     0.050  0.125  lost  +150.00%
! #     0.050  0.025  won    -50.00%
! #     0.050  0.050  tied
! #     0.000  0.025  lost  +(was 0)
! #     0.000  0.025  lost  +(was 0)
! #     0.075  0.050  won    -33.33%
! #     0.025  0.050  lost  +100.00%
! #     0.000  0.000  tied
! #     0.025  0.100  lost  +300.00%
! #     0.050  0.150  lost  +200.00%
! #
! # won   5 times
! # tied  4 times
! # lost 11 times
! #
! # total unique fp went from 13 to 21
! #
! # false negative percentages
! #     0.327  0.218  won    -33.33%
! #     0.400  0.218  won    -45.50%
! #     0.327  0.218  won    -33.33%
! #     0.691  0.691  tied
! #     0.545  0.327  won    -40.00%
! #     0.291  0.218  won    -25.09%
! #     0.218  0.291  lost   +33.49%
! #     0.654  0.473  won    -27.68%
! #     0.364  0.327  won    -10.16%
! #     0.291  0.182  won    -37.46%
! #     0.327  0.254  won    -22.32%
! #     0.691  0.509  won    -26.34%
! #     0.582  0.473  won    -18.73%
! #     0.291  0.255  won    -12.37%
! #     0.364  0.218  won    -40.11%
! #     0.436  0.327  won    -25.00%
! #     0.436  0.473  lost    +8.49%
! #     0.218  0.218  tied
! #     0.291  0.255  won    -12.37%
! #     0.254  0.364  lost   +43.31%
! #
! # won  15 times
! # tied  2 times
! # lost  3 times
! #
! # total unique fn went from 106 to 94

  ##############################################################################
***************
*** 313,318 ****
  # do that part.  However, even after stripping tags, the rates above show that
  # at least 98% of spams are still correctly identified as spam.
! # XXX So, if another way is found to slash the f-n rate, the decision here
! # XXX not to strip HTML from HTML-only msgs should be revisited.

  ##############################################################################
--- 403,471 ----
  # do that part.  However, even after stripping tags, the rates above show that
  # at least 98% of spams are still correctly identified as spam.
! #
! # So, if another way is found to slash the f-n rate, the decision here not
! # to strip HTML from HTML-only msgs should be revisited.
! #
! # Later, after the f-n rate got slashed via other means:
! #
! # false positive percentages
! #     0.000  0.000  tied
! #     0.000  0.000  tied
! #     0.050  0.075  lost   +50.00%
! #     0.025  0.025  tied
! #     0.075  0.025  won    -66.67%
! #     0.000  0.000  tied
! #     0.100  0.100  tied
! #     0.050  0.075  lost   +50.00%
! #     0.025  0.025  tied
! #     0.025  0.000  won   -100.00%
! #     0.050  0.075  lost   +50.00%
! #     0.050  0.050  tied
! #     0.050  0.025  won    -50.00%
! #     0.000  0.000  tied
! #     0.000  0.000  tied
! #     0.075  0.075  tied
! #     0.025  0.025  tied
! #     0.000  0.000  tied
! #     0.025  0.025  tied
! #     0.050  0.050  tied
! #
! # won   3 times
! # tied 14 times
! # lost  3 times
! #
! # total unique fp went from 13 to 11
! #
! # false negative percentages
! #     0.327  0.400  lost   +22.32%
! #     0.400  0.400  tied
! #     0.327  0.473  lost   +44.65%
! #     0.691  0.654  won     -5.35%
! #     0.545  0.473  won    -13.21%
! #     0.291  0.364  lost   +25.09%
! #     0.218  0.291  lost   +33.49%
! #     0.654  0.654  tied
! #     0.364  0.473  lost   +29.95%
! #     0.291  0.327  lost   +12.37%
! #     0.327  0.291  won    -11.01%
! #     0.691  0.654  won     -5.35%
! #     0.582  0.655  lost   +12.54%
! #     0.291  0.400  lost   +37.46%
! #     0.364  0.436  lost   +19.78%
! #     0.436  0.582  lost   +33.49%
! #     0.436  0.364  won    -16.51%
! #     0.218  0.291  lost   +33.49%
! #     0.291  0.400  lost   +37.46%
! #     0.254  0.327  lost   +28.74%
! #
! # won   5 times
! # tied  2 times
! # lost 13 times
! #
! # total unique fn went from 106 to 122
! #
! # So HTML decorations are still a significant clue when the ham is composed
! # of c.l.py traffic.  Again, this should be revisited if the f-n rate is
! # slashed again.

  ##############################################################################