[spambayes-dev] Very small change for composite word tokenizing.

Mon Aug 4 16:56:29 EDT 2003

This is the code that does it, in context, if not in patch form. I had
mailed to to Tony, but not the whole list.
Sorry about that.

-- Sean

Not exactly a patch, but it's a one minute cut and paste. I'm theorizing
that the memory hit is not horrendous -- mostly generates sensible fragments
www.microsoft.com -> www, microsoft, com
Very_naughty_bits -> very, naughty, bits

-> longword_re = re.compile(r"[a-zA-Z1-9$]+")

   def tokenize_word(word, _len=len, maxword=options.skip_max_word_size):
       n = _len(word)
       # Make sure this range matches in tokenize().
       if 3 <= n <= maxword:
           yield word

       elif n >= 3:
           # A long word.

           # Don't want to skip embedded email addresses.
           # An earlier scheme also split up the y in x at y on '.'.  Not
splitting
           # improved the f-n rate; the f-p rate didn't care either way.
           if n < 40 and '.' in word and word.count('@') == 1:
               p1, p2 = word.split('@')
               yield 'email name:' + p1
               yield 'email addr:' + p2
<  
           else:
               # There's value in generating a token indicating roughly how
               # many chars were skipped.  This has real benefit for the f-n
               # rate, but is neutral for the f-p rate.  I don't know why!
               # XXX Figure out why, and/or see if some other way of
summarizing
               # XXX this info has greater benefit.
               if options.generate_long_skips:
                   yield "skip:%c %d" % (word[0], n // 10 * 10)
               if has_highbit_char(word):
                   hicount = 0
                   for i in map(ord, word):
                       if i >= 128:
                           hicount += 1
                   yield "8bit%%:%d" % round(hicount * 100.0 / len(word))

->             # Break up composite words looking for good stuff
->             for w in longword_re.findall(word):
->                 if 3 <= len(w) <= maxword:
->                     yield word
->