how to get text between HTML tags with URLLIB??

Alex Martelli alex at magenta.com
Sat Aug 19 04:15:38 EDT 2000


"Roy Katz" <katz at Glue.umd.edu> wrote in message
news:Pine.GSO.4.21.0008182244560.1957-100000 at z.glue.umd.edu...
> There should be a way through urllib, right?
> if urllib can't do it, then I see at as a deficiency in urllib.
> but thanks for the regexp!

urllib has the purpose to "Open an arbitrary resource by URL".

It has nothing to do with the internal structure of the resulting
stream that you so open.  I would consider it a horrible wart
if a module to open arbitrary resources had functionality to let
you parse internal structure of some (but not all) kinds of such
resources.


If the resource you open is an HTML stream, you can parse it
through htmllib (a rather low-level approach, not particularly
easy to use) or through other, higher-level (and thus easier
to use) HTML parsers; for example, to respect the relevant
W3C standards for the document object model to use with
HTML documents, look at 4DOM:
    http://fourthought.com/4Suite/4DOM/

There are many other implementations of HTML parsers for
Python, but I suggest you look into 4DOM -- looks best to
me.  I suggest NOT using regexes to try to parse HTML, or
you'll be systematically thrown by HTML comments, quoted
strings, etc, etc; a good parser handles those for you.


Alex






More information about the Python-list mailing list