Parsing an HTML a tag

Sat Sep 24 14:34:20 EDT 2005

"beza1e1" <andreas.zwinkau at googlemail.com> writes:

> I do not really know, what you want to do. Getting he urls from the a
> tags of a html file? I think the easiest method would be a regular
> expression.

I think this ranks as #2 on the list of "difficult one-day
hacks". Yeah, it's simple to write an RE that works most of the
time. It's a major PITA to write one that works in all the legal
cases. Getting one that also handles all the cases seen in the wild is
damn near impossible.

>>>>import urllib, sre
>>>>html = urllib.urlopen("http://www.google.com").read()
>>>>sre.findall('href="([^>]+)"', html)

This fails in a number of cases. Whitespace around the "=" sign for
attibutes. Quotes around other attributes in the tag (required by
XHTML). '>' in the URL (legal, but disrecommended). Attributes quoted
with single quotes instead of double quotes, or just unqouted. It
misses IMG SRC attributes. It hands back relative URLs as such,
instead of resolving them to the absolute URL (which requires checking
for the base URL in the HEAD), which may or may not be acceptable.

> Google has some strange html, href without quotation marks: <a
> href=http://www.google.com/ncr>Google.com in English</a>

That's not strange. That's just a bit unusual. Perfectly legal, though
- any browser (or other html processor) that fails to handle it is
broken.

        <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.