[Spambayes] Tokenizer output text range, high bits
Brad Clements
bkc@murkworks.com
Mon, 14 Oct 2002 16:12:48 -0400
I thought I'd read in the list that the tokenizer doesn't return chars with the "high bit" set,
just creates a new token indicating that.
So, when going through the classifier wordlist keys, I don't expect to see any keys with
chars where ord(c) & 0x80 != 0
however, I am finding some.
Also, finding chars whose ord() < 32.
I'm not so worried about the later (as long as there aren't any nuls), but somewhat
concerned about the high-bit. Unicode? I don't want to deal with that just now.. :-(
Brad Clements, bkc@murkworks.com (315)268-1000
http://www.murkworks.com (315)268-9812 Fax
AOL-IM: BKClements