HTML Parser

Jan Dries jdries at mail.com
Sat Dec 30 22:28:42 EST 2000


> "Voitenko, Denis" wrote:
> 
> I am trying to write an HTML parser. I am starting off with a simple
> one like so:
> 
[snip]
> HTMLtags=re.compile('<.*>')

This is presumably the root of your problems. The expression will match
the longest string that starts with a < and ends with a >, including
strings with < and > in the middle. So in your example, the HTMLtags
matches the entire string "<a href=hello.jsp>Hello</a>", because it's
the longest part of your input that starts with < and ends with >.

Try the following instead:

    HTMLtags=re.compile('<[^>]*>')

It will match strings that start with <, end with > and have no > in the
middle. Depending upon the HTML you wish to parse, even this might not
be enough though, because the value of some HTML attributes may contain
a ">".

Regards,
Jan




More information about the Python-list mailing list