[Spambayes] New web training interface for pop3proxy

Sat Nov 23 22:51:31 2002

[Richie Hindle]
> The code to strip HTML content uses a regular expression
> from tokenizer.py which is commented "Cheap-ass gimmick", so I'm
> interested to see how well people find it works!

[Tim Peters]
> Are you using this regexp *from* Python, or from Javascript?
> I have half a mind to replace the comment and style nuking with an
> iterative, stack-friendly scheme (like, e.g., crack_uuencode() and
> crack_urls(), which only use regexps to help find the right places to poke
> at -- they can't blow the C stack).  But if you're doing this from
> Javascript, that wouldn't help you.

I'm using it from Python, but (currently) only in a relatively unimportant
feature.  I wouldn't call it worth changing for the sake of the hovertips,
but it's definitely worth changing to fix François' stack explosion.

Somewhere (I can't find it right now, but I'll have a proper look ASAP) I
have an attempt at rewriting Tom Christiansen's striphtml program
(http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz)
using non-greedy Python regexps - it's mostly a mechanical rewrite,
changing things like "<.*?>" to "<[^>]*>", and so forth.  That might work -
I'll try to dig it out.

-- 
Richie Hindle
richie@entrian.com