[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.5,1.6
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Sat, 31 Aug 2002 21:42:54 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18035
Modified Files:
timtest.py
Log Message:
Long new comment block summarizing all my experiments with character
n-grams. Bottom line is that they have nothing going for them, and a
lot going against them, under Graham's scheme. I believe there may
still be a place for them in *part* of a word-based tokenizer, though.
Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** timtest.py 31 Aug 2002 21:33:10 -0000 1.5
--- timtest.py 1 Sep 2002 04:42:51 -0000 1.6
***************
*** 57,71 ****
return text - redundant_html
- url_re = re.compile(r"""
- (https? | ftp) # capture the protocol
- :// # skip the boilerplate
- # Do a reasonable attempt at detecting the end. It may or may not
- # be in HTML, may or may not be in quotes, etc. If it's full of %
- # escapes, cool -- that's a clue too.
- ([^\s<>'"\x7f-\xff]+) # capture the guts
- """, re.IGNORECASE | re.VERBOSE)
-
- urlsep_re = re.compile(r"[;?:@&=+,$.]")
-
# To fold case or not to fold case? I didn't want to fold case, because
# it hides information in English, and I have no idea what .lower() does
--- 57,60 ----
***************
*** 93,96 ****
--- 82,180 ----
# Talk about "money" and "lucrative" is indistinguishable now from talk
# about "MONEY" and "LUCRATIVE", and spam mentions MONEY a lot.
+
+
+ # Character n-grams or words?
+ #
+ # With careful multiple-corpora c.l.py tests sticking to case-folded decoded
+ # text-only portions, and ignoring headers, and with identical special
+ # parsing & tagging of embedded URLs:
+ #
+ # Character 3-grams gave 5x as many false positives as split-on-whitespace
+ # (s-o-w). The f-n rate was also significantly worse, but within a factor
+ # of 2. So character 3-grams lost across the board.
+ #
+ # Character 5-grams gave 32% more f-ps than split-on-whitespace, but the
+ # s-o-w fp rate across 20,000 presumed-hams was 0.1%, and this is the
+ # difference between 23 and 34 f-ps. There aren't enough there to say that's
+ # significnatly more with killer-high confidence. There were plenty of f-ns,
+ # though, and the f-n rate with character 5-grams was substantially *worse*
+ # than with character 3-grams (which in turn was substantially worse than
+ # with s-o-w).
+ #
+ # Training on character 5-grams creates many more unique tokens than s-o-w:
+ # a typical run bloated to 150MB process size. It also ran a lot slower than
+ # s-o-w, partly related to heavy indexing of a huge out-of-cache wordinfo
+ # dict. I rarely noticed disk activity when running s-o-w, so rarely bothered
+ # to look at process size; it was under 30MB last time I looked.
+ #
+ # Figuring out *why* a msg scored as it did proved much more mysterious when
+ # working with character n-grams: they often had no obvious "meaning". In
+ # contrast, it was always easy to figure out what s-o-w was picking up on.
+ # 5-grams flagged a msg from Christian Tismer as spam, where he was discussing
+ # the speed of tasklets under his new implementation of stackless:
+ #
+ # prob = 0.99999998959
+ # prob('ed sw') = 0.01
+ # prob('http0:pgp') = 0.01
+ # prob('http0:python') = 0.01
+ # prob('hlon ') = 0.99
+ # prob('http0:wwwkeys') = 0.01
+ # prob('http0:starship') = 0.01
+ # prob('http0:stackless') = 0.01
+ # prob('n xp ') = 0.99
+ # prob('on xp') = 0.99
+ # prob('p 150') = 0.99
+ # prob('lon x') = 0.99
+ # prob(' amd ') = 0.99
+ # prob(' xp 1') = 0.99
+ # prob(' athl') = 0.99
+ # prob('1500+') = 0.99
+ # prob('xp 15') = 0.99
+ #
+ # The spam decision was baffling until I realized that *all* the high-
+ # probablity spam 5-grams there came out of a single phrase:
+ #
+ # AMD Athlon XP 1500+
+ #
+ # So Christian was punished for using a machine lots of spam tries to sell
+ # <wink>. In a classic Bayesian classifier, this probably wouldn't have
+ # mattered, but Graham's throws away almost all the 5-grams from a msg,
+ # saving only the about-a-dozen farthest from a neutral 0.5. So one bad
+ # phrase can kill you! This appears to happen very rarely, but happened
+ # more than once.
+ #
+ # The conclusion is that character n-grams have almost nothing to recommend
+ # them under Graham's scheme: harder to work with, slower, much larger
+ # database, worse results, and prone to rare mysterious disasters.
+ #
+ # There's one area they won hands-down: detecting spam in what I assume are
+ # Asian languages. The s-o-w scheme sometimes finds only line-ends to split
+ # on then, and then a "hey, this 'word' is way too big! let's ignore it"
+ # gimmick kicks in, and produces no tokens at all.
+ #
+ # XXX Try producing character n-grams then under the s-o-w scheme, instead
+ # XXX of igoring the blob. This was too unattractive before because we
+ # XXX weren't# decoding base64 or qp. We're still not decoding uuencoded
+ # XXX stuff. So try this only if there are high-bit characters in the blob.
+ #
+ # Interesting: despite that odd example above, the *kinds* of f-p mistakes
+ # 5-grams made were very much like s-o-w made -- I recognized almost all of
+ # the 5-gram f-p messages from previous s-o-w runs. For example, both
+ # schemes have a particular hatred for conference announcements, although
+ # s-o-w stopped hating them after folding case. But 5-grams still hate them.
+ # Both schemes also hate msgs discussing HTML with examples, with about equal
+ # passion. Both schemes hate brief "please subscribe [unsubscribe] me"
+ # msgs, although 5-grams seems to hate them more.
+
+ url_re = re.compile(r"""
+ (https? | ftp) # capture the protocol
+ :// # skip the boilerplate
+ # Do a reasonable attempt at detecting the end. It may or may not
+ # be in HTML, may or may not be in quotes, etc. If it's full of %
+ # escapes, cool -- that's a clue too.
+ ([^\s<>'"\x7f-\xff]+) # capture the guts
+ """, re.IGNORECASE | re.VERBOSE)
+
+ urlsep_re = re.compile(r"[;?:@&=+,$.]")
def tokenize(string):