Parsing an HTML a tag

Mike Meyer mwm at mired.org
Sat Sep 24 15:47:36 EDT 2005


"beza1e1" <andreas.zwinkau at googlemail.com> writes:

> I think for a quick hack, this is as good as a parser. A simple parser
> would miss some cases as well. RE are nearly not extendable though, so
> your critic is valid.

Pretty much any first attempt is going to miss some cases. There
libraries available that are have stood the test of time. Simply
usinng one of those is the right solution.

> The point is, what George wants to do. A mixture would be possible as
> well:
> Getting all <a ...> by a RE and then extracting the url with something
> like a parser.

I thought the point was to extract all URLs? Those appear in
attributes of tags other than A tags. While that's a meta-problem that
requires properly configuring the parser to deal with, it's something
that's *much* simpler to do if you've got a parser that understands
the structure of HTML - you should be able to specify tag/attribute
pairs to look for - than with something that is treating it as
unstructured text.

         <mike

-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.



More information about the Python-list mailing list