[Spambayes] Watch out for this

Tim Peters tim.one at comcast.net
Wed Sep 10 18:44:54 EDT 2003


[Balazs Attila Mihaly]
> ...
> as you know in html pages characters can be written as #<number>;
> where <number> represents the ASCII (or maybe UNICODE - I'm not
> sure) code of the character. Now, if  you don't convert these characters
> back to their corresponding values ...

spambayes already decodes numeric character entities.  That's what

            # Replace numeric character entities (like &#97; for the letter
            # 'a').
            text = numeric_entity_re.sub(numeric_entity_replacer, text)

in Tokenizer.tokenize_body() does.

It's a relatively recent addition.  I didn't see false negatives due to this
trick before adding the decoding, but did get a number of irritating Unsures
that were stopped cold by doing this decoding.




More information about the Spambayes mailing list