Parsing html

Thomas Guettler guettli at thomas-guettler.de
Fri Jul 9 09:02:26 EDT 2004


Am Thu, 08 Jul 2004 17:04:24 +0100 schrieb C Gillespie:

> Dear All,
> 
> I have hopefully a very simple problem. I wish to parse an html page and
> extract everything between the <body> tags.
> 
> E.g.
> <head>
>     <body>
>         <b>afsdf</b>
>     </body>
> </head>
> 
> Would give
> <body>
>     <b>afsdf</b>
> </body>
> 
> I've been playing about with htmllib with no successful. Any suggestions?

HTML can be broken in many ways. If you want
a solution which can read most of the HTML on the
web, you can use tidy and use XML as output.


XML can be handled much easier with SAX/DOM.

Regards,
 Thomas

-- 
Thomas Güttler, http://www.thomas-guettler.de/





More information about the Python-list mailing list