[Spambayes] New web training interface for pop3proxy
Richie Hindle
richie@entrian.com
Sat Nov 23 22:51:31 2002
[Richie Hindle]
> The code to strip HTML content uses a regular expression
> from tokenizer.py which is commented "Cheap-ass gimmick", so I'm
> interested to see how well people find it works!
[Tim Peters]
> Are you using this regexp *from* Python, or from Javascript?
> I have half a mind to replace the comment and style nuking with an
> iterative, stack-friendly scheme (like, e.g., crack_uuencode() and
> crack_urls(), which only use regexps to help find the right places to poke
> at -- they can't blow the C stack). But if you're doing this from
> Javascript, that wouldn't help you.
I'm using it from Python, but (currently) only in a relatively unimportant
feature. I wouldn't call it worth changing for the sake of the hovertips,
but it's definitely worth changing to fix François' stack explosion.
Somewhere (I can't find it right now, but I'll have a proper look ASAP) I
have an attempt at rewriting Tom Christiansen's striphtml program
(http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz)
using non-greedy Python regexps - it's mostly a mechanical rewrite,
changing things like "<.*?>" to "<[^>]*>", and so forth. That might work -
I'll try to dig it out.
--
Richie Hindle
richie@entrian.com
More information about the Spambayes
mailing list