[Spambayes] New web training interface for pop3proxy

Tim Peters tim.one@comcast.net
Sat Nov 23 22:10:27 2002


[David Ascher]
> Make 'hovertips' that display the first few lines of the body

[Richie Hindle]
> This is done.  The code to strip HTML content uses a regular expres=
sion
> from tokenizer.py which is commented "Cheap-ass gimmick", so I'm
> interested to see how well people find it works!

It works very well except when it doesn't <wink>.  The chief damned-
whether-you-do-or-don't problem:  I've seen several msgs with HTML style
sheets and/or HTML comments exceeding 2K characters.  The 2K limit in the
minimal matches serves two purposes:

1. Prevent the C stack from blowing up in the regexp engine.  But
   François Granger reported a C stack blowup anyway on Mac OS 9,
   and I still have no clue how small a limit would prevent that on
   his box.

2. Prevent it from consuming an arbitrary amount of text in case
   we matched a "begin long construct" character sequence by accident.
   It's *unlikely* that random test contains <style" or "<!--"
   by accident, though, so I'm not much worried about that one.

> (Apologies to Tim - it seems to work extremely well.)

Yes, when it works at all <wink>.  Fixing it in all cases requires doing
real HTML parsing, and that's expensive, so the current "cheap-ass gimmick"
is accurate.

> Rest assures it's safe from HTML content leaking into the web
> interface - the worst that will happen is that you'll see HTML sour=
ce
> in the hovertip.

A giant <style .. </style> section near the start seems the most likely
glitch here.  Are you using this regexp *from* Python, or from Javascript?
I have half a mind to replace the comment and style nuking with an
iterative, stack-friendly scheme (like, e.g., crack_uuencode() and
crack_urls(), which only use regexps to help find the right places to poke
at -- they can't blow the C stack).  But if you're doing this from
Javascript, that wouldn't help you.





More information about the Spambayes mailing list