HTML parsing confusion

elijahu at gmail.com elijahu at gmail.com
Tue Jan 22 22:19:15 EST 2008


On Jan 22, 7:29 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
wrote:

>
> > I was asking this community if there was a simple way to use only the
> > tools included with Python to parse a bit of html.
>
> If you *know* that your document is valid HTML, you can use the HTMLParser
> module in the standard Python library. Or even the parser in the htmllib
> module. But a lot of HTML pages out there are invalid, some are grossly
> invalid, and those parsers are just unable to handle them. This is why
> modules like BeautifulSoup exist: they contain a lot of heuristics and
> trial-and-error and personal experience from the developers, in order to
> guess more or less what the page author intended to write and make some
> sense of that "tag soup".
> A guesswork like that is not suitable for the std lib ("Errors should
> never pass silently" and "In the face of ambiguity, refuse the temptation
> to guess.") but makes a perfect 3rd party module.
>
> If you want to use regular expressions, and that works OK for the
> documents you are handling now, fine. But don't complain when your RE's
> match too much or too little or don't match at all because of unclosed
> tags, improperly nested tags, nonsense markup, or just a valid combination
> that you didn't take into account.
>
> --
> Gabriel Genellina

Thank you. That does make perfect sense, and is a good clear position
on the up and down side of what I'm trying to do, as well as a good
explanation for why BeautifulSoup will probably remain outside the std
lib. I'm sure that I will get plenty of use out of it.

If, however, I am sure that the html code in  target documents is
good, and the framework html doesn't change, just the data on page
after page of static html, would it be better to just go with regex or
with one of the std lib items you mentioned. I thought the latter, but
I'm stuck on how to make them generate results similar to the code I
put above as an example. I'm not trying to code this to go against
html in the wild, but to try to strip specific, consistently located
data from the markup and turn it into something more useful.

I may have confused folks by using the www.diveintopython.org page as
an example, but its html seemed to be valid strict tags.



More information about the Python-list mailing list