HTML parsing confusion

Tue Jan 22 19:29:56 EST 2008

En Tue, 22 Jan 2008 19:20:32 -0200, Alnilam <alnilam at gmail.com> escribió:

> On Jan 22, 11:39 am, "Diez B. Roggisch" <de... at nospam.web.de> wrote:
>> Alnilam wrote:
>> > On Jan 22, 8:44 am, Alnilam <alni... at gmail.com> wrote:
>> >> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
>> >> > -1)) doesn't have an xml.dom.ext ... you must have the  
>> mega-monstrous
>> >> > 200-modules PyXML package installed. And you don't want the 75Kb
>> >> > BeautifulSoup?
>> > Ugh. Found it. Sorry about that, but I still don't understand why
>> > there isn't a simple way to do this without using PyXML, BeautifulSoup
>> > or libxml2dom. What's the point in having sgmllib, htmllib,
>> > HTMLParser, and formatter all built in if I have to use use someone
>> > else's modules to write a couple of lines of code that achieve the
>> > simple thing I want. I get the feeling that this would be easier if I
>> > just broke down and wrote a couple of regular expressions, but it
>> > hardly seems a 'pythonic' way of going about things.
>>
>> This is simply a gross misunderstanding of what BeautifulSoup or lxml
>> accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
>> sense is by no means trivial. And just because you can come up with a  
>> few
>> lines of code using rexes that work for your current use-case doesn't  
>> mean
>> that they serve as general html-fixing-routine. Or do you think the  
>> rather
>> long history and 75Kb of code for BS are because it's creator wasn't  
>> aware
>> of rexes?
>
> I am, by no means, trying to trivialize the work that goes into
> creating the numerous modules out there. However as a relatively
> novice programmer trying to figure out something, the fact that these
> modules are pushed on people with such zealous devotion that you take
> offense at my desire to not use them gives me a bit of pause. I use
> non-included modules for tasks that require them, when the capability
> to do something clearly can't be done easily another way (eg.
> MySQLdb). I am sure that there will be plenty of times where I will
> use BeautifulSoup. In this instance, however, I was trying to solve a
> specific problem which I attempted to lay out clearly from the
> outset.
>
> I was asking this community if there was a simple way to use only the
> tools included with Python to parse a bit of html.

If you *know* that your document is valid HTML, you can use the HTMLParser  
module in the standard Python library. Or even the parser in the htmllib  
module. But a lot of HTML pages out there are invalid, some are grossly  
invalid, and those parsers are just unable to handle them. This is why  
modules like BeautifulSoup exist: they contain a lot of heuristics and  
trial-and-error and personal experience from the developers, in order to  
guess more or less what the page author intended to write and make some  
sense of that "tag soup".
A guesswork like that is not suitable for the std lib ("Errors should  
never pass silently" and "In the face of ambiguity, refuse the temptation  
to guess.") but makes a perfect 3rd party module.

If you want to use regular expressions, and that works OK for the  
documents you are handling now, fine. But don't complain when your RE's  
match too much or too little or don't match at all because of unclosed  
tags, improperly nested tags, nonsense markup, or just a valid combination  
that you didn't take into account.

-- 
Gabriel Genellina