[Spambayes] Tokenizer output text range, high bits

Brad Clements bkc@murkworks.com
Mon, 14 Oct 2002 16:12:48 -0400


I thought I'd read in the list that the tokenizer doesn't return chars with the "high bit" set, 
just creates a new token indicating that.

So, when going through the classifier wordlist keys, I don't expect to see any keys with 
chars where ord(c) & 0x80 != 0

however, I am finding some.

Also, finding chars whose ord() < 32. 

I'm not so worried about the later (as long as there aren't any nuls), but somewhat 
concerned about the high-bit. Unicode? I don't want to deal with that just now.. :-(



Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements