[Spambayes] Missing HTML payload
Tim Peters
tim.one at comcast.net
Mon Mar 3 20:50:42 EST 2003
[Mark Hammond]
> The following mail got past SpamBayes. Looking at the clues, it appears
> that spambayes was missing the HTML body of the message (which
> *does* render almost correctly in Outlook).
>
> I instrumented the "show clues" feature to show *all* message tokens
> found in the body. As you can see at the very end, the entire body was
> stripped.
>
> I am guessing that we barf on:
> <td><!--#rotato>
> a comment which is never closed.
That would do it! tokenizer.py's Stripper class eliminates (via subclasses)
various kinds of bracketed structures, and HTML comments are among them. I
see that the analyze() method will just ignore any text at and after the
last open-bracket match without a matching end-bracket construct. This was
neither intentional nor unintentional <wink>. It seems like it would be
better to replace:
m = self.find_end(text, end)
if not m:
break
with:
m = self.find_end(text, end)
if not m:
pushretained(text[start :]) # add this line
break
Then the unmatched open-bracket construct, and everything following it, will
be retained. This will apply to unclosed HTML comments, unclosed style
sheets, unclosed uuencoded sections, and unclosed embedded URLs. I think
I'm fine with retaining all of those.
> Outlook actually shows this entire tag (ie, literally "<!--#rotato>",
> then displays the rest of the HTML correctly - ie, I guess that we treat
> the comment as unclosed, while Outlook ignores it.
Sounds right.
> Any thoughts?
Nope, not a one <wink>.
More information about the Spambayes
mailing list