[spambayes-dev] Very small change for composite word tokenizing.
Sean True
seant at webreply.com
Mon Aug 4 16:56:29 EDT 2003
This is the code that does it, in context, if not in patch form. I had
mailed to to Tony, but not the whole list.
Sorry about that.
-- Sean
Not exactly a patch, but it's a one minute cut and paste. I'm theorizing
that the memory hit is not horrendous -- mostly generates sensible fragments
www.microsoft.com -> www, microsoft, com
Very_naughty_bits -> very, naughty, bits
-> longword_re = re.compile(r"[a-zA-Z1-9$]+")
def tokenize_word(word, _len=len, maxword=options.skip_max_word_size):
n = _len(word)
# Make sure this range matches in tokenize().
if 3 <= n <= maxword:
yield word
elif n >= 3:
# A long word.
# Don't want to skip embedded email addresses.
# An earlier scheme also split up the y in x at y on '.'. Not
splitting
# improved the f-n rate; the f-p rate didn't care either way.
if n < 40 and '.' in word and word.count('@') == 1:
p1, p2 = word.split('@')
yield 'email name:' + p1
yield 'email addr:' + p2
<
else:
# There's value in generating a token indicating roughly how
# many chars were skipped. This has real benefit for the f-n
# rate, but is neutral for the f-p rate. I don't know why!
# XXX Figure out why, and/or see if some other way of
summarizing
# XXX this info has greater benefit.
if options.generate_long_skips:
yield "skip:%c %d" % (word[0], n // 10 * 10)
if has_highbit_char(word):
hicount = 0
for i in map(ord, word):
if i >= 128:
hicount += 1
yield "8bit%%:%d" % round(hicount * 100.0 / len(word))
-> # Break up composite words looking for good stuff
-> for w in longword_re.findall(word):
-> if 3 <= len(w) <= maxword:
-> yield word
->
More information about the spambayes-dev
mailing list