Parsing html

Richard Brodie R.Brodie at rl.ac.uk
Fri Jul 9 06:32:43 EDT 2004


> > I have hopefully a very simple problem. I wish to parse an html page and
> > extract everything between the <body> tags.
>
> People are actually suggesting using DOM for this?! A simple approach is
> much better:

"For every complex problem, there is a solution that is simple ... and wrong"
Yes, it will work, some of the time. However, it doesn't handle the following
properly (there are probably others).

1. Comments.
2. CDATA sections.
3. White space.
4. Mixed or upper case.

The advantage of using a proper parser is that it caters for these sort of things,
and you only have to get it right once. OTOH, these advantages are largely
negated, if you can't be sure your input HTML is valid. What works best for
you depends on what you are using it for.





More information about the Python-list mailing list