HTML parsing examples

Wed Jan 3 05:52:03 EST 2001

Voitenko, Denis wrote:

>Anyways, I am trying to figure out how to parse HTML with htmlllib and I am
>completely lost. Can somone give examples?
>Say, I have a file like this
>
><html>
><body>
><p>Hello, World!
></body>
></html>
>
>I would like to find methods and contents of <body> and contents of <p>. How
>would I do that?

You can find examples in previous threads or in the effbots guide to the
standard library. Just search for sgmllib.

But you will find that extracting the content of the <p> tag is in fact much
harder than it might seem (in the general case). The reason is that the
sgmllib and htmllib modules does not "know" that the <p> tag should be
automatically closed by the </body> tag. So you will have to implement
this logic yourself. You will find that in the general case it will be
quite hard to tell exactly where a particular <p> tag starts and ends. The
net is full of poorly written web pages which makes this parsing even harder.

The task of parsing badly formatted web pages is so common and so complicated
that it really deserves its own module.  I've been doing some work on
simplifying HTML parsing and created modules for Python and Perl. You can find
them at:

	http://www.acc.umu.se/~r2d2/files/

Currently the Perl module (HTML::Transform) is the more mature one. It is
able to find the ending of <p> tags and other similar tags which are often
used in a sloppy fashion (<li>, <td>, etc). It also handles badly written
documents reasonably well. Unfortunately, this functionality has not yet
been implemented in the corresponding Python module. I plan to bring the
Python module up to the same level of maturity as the Perl module, but I
haven't had the time or incentive to do it yet.

// Niklas