HTML Parser
Jan Dries
jdries at mail.com
Sat Dec 30 22:28:42 EST 2000
> "Voitenko, Denis" wrote:
>
> I am trying to write an HTML parser. I am starting off with a simple
> one like so:
>
[snip]
> HTMLtags=re.compile('<.*>')
This is presumably the root of your problems. The expression will match
the longest string that starts with a < and ends with a >, including
strings with < and > in the middle. So in your example, the HTMLtags
matches the entire string "<a href=hello.jsp>Hello</a>", because it's
the longest part of your input that starts with < and ends with >.
Try the following instead:
HTMLtags=re.compile('<[^>]*>')
It will match strings that start with <, end with > and have no > in the
middle. Depending upon the HTML you wish to parse, even this might not
be enough though, because the value of some HTML attributes may contain
a ">".
Regards,
Jan
More information about the Python-list
mailing list