HTML parsing confusion

Tue Jan 22 22:41:20 EST 2008

On Jan 22, 7:29 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
wrote:
>
> > I was asking this community if there was a simple way to use only the
> > tools included with Python to parse a bit of html.
>
> If you *know* that your document is valid HTML, you can use the HTMLParser  
> module in the standard Python library. Or even the parser in the htmllib  
> module. But a lot of HTML pages out there are invalid, some are grossly  
> invalid, and those parsers are just unable to handle them. This is why  
> modules like BeautifulSoup exist: they contain a lot of heuristics and  
> trial-and-error and personal experience from the developers, in order to  
> guess more or less what the page author intended to write and make some  
> sense of that "tag soup".
> A guesswork like that is not suitable for the std lib ("Errors should  
> never pass silently" and "In the face of ambiguity, refuse the temptation  
> to guess.") but makes a perfect 3rd party module.
>
> If you want to use regular expressions, and that works OK for the  
> documents you are handling now, fine. But don't complain when your RE's  
> match too much or too little or don't match at all because of unclosed  
> tags, improperly nested tags, nonsense markup, or just a valid combination  
> that you didn't take into account.
>
> --
> Gabriel Genellina

Thanks, Gabriel. That does make sense, both what the benefits of
BeautifulSoup are and why it probably won't become std lib anytime
soon.

The pages I'm trying to write this code to run against aren't in the
wild, though. They are static html files on my company's lan, are very
consistent in format, and are (I believe) valid html. They just have
specific paragraphs of useful information, located in the same place
in each file, that I want to 'harvest' and put to better use. I used
diveintopython.org as an example only (and in part because it had good
clean html formatting). I am pretty sure that I could craft some
regular expressions to do the work -- which of course would not be the
case if I was screen scraping web pages in the 'wild' -- but I was
trying to find a way to do that using one of those std libs you
mentioned.

I'm not sure if HTMLParser or htmllib would work better to achieve the
same effect as the regex example I gave above, or how to get them to
do that. I thought I'd come close, but as someone pointed out early
on, I'd accidently tapped into PyXML which is installed where I was
testing code, but not necessarily where I need it. It may turn out
that the regex way works faster, but falling back on methods I'm
comfortable with doesn't help expand my Python knowledge.

So if anyone can tell me how to get HTMLParser or htmllib to grab a
specific paragraph, and then provide the text in that paragraph in a
clean, markup-free format, I'd appreciate it.