BeautifulSoup vs. real-world HTML comments

Robert Kern robert.kern at gmail.com
Wed Apr 4 16:55:21 EDT 2007


Carl Banks wrote:
> On Apr 4, 2:43 pm, Robert Kern <robert.k... at gmail.com> wrote:
>> Carl Banks wrote:
>>> On Apr 4, 2:08 pm, John Nagle <n... at animats.com> wrote:
>>>> BeautifulSoup can't parse this page usefully at all.
>>>> It treats the entire page as a text chunk.  It's actually
>>>> HTMLParser that parses comments, so this is really an HTMLParser
>>>> level problem.
>>> Google for a program called "tidy".  Install it, and run it as a
>>> filter on any HTML you download.  "tidy" has invested in it quite a
>>> bit of work understanding common bad HTML and how browsers deal with
>>> it.  It would be pointless to duplicate that work in the Python
>>> standard library; let HTMLParser be small and tight, and outsource the
>>> handling of floozy input to a dedicated program.
>> Well, BeautifulSoup is just such a dedicated library.
> 
> No, not really.

Yes, it is. Whether it succeeds in all particulars is besides the point. The
only mission of BeautifulSoup is to handle bad HTML. That tidy doesn't
successfully handle some other subset of bad HTML doesn't mean it's not a
dedicated program for handling bad HTML.

>> However, it defers its
>> handling of comments to HTMLParser. That's the problem.
> 
> Well, it's up to the writers of Beautiful Soup to decide how much bad
> HTML they want to accept.  ISTM they're happy to live with the
> limitations of HTMLParser, meaning that they do not consider Beautiful
> Soup to be a library dedicated to reading every piece of bad HTML out
> there.

Sorry, let me be clearer: The problem is that they haven't overridden the
handling of comments of SGMLParser (not HTMLParser, sorry) like it has many
other parts of SGMLParser. Yes, any fix should go into BeautifulSoup and not
SGMLParser.

All it takes is someone to code up their desired behavior for these perverse
comments and submit it to Leonard Richardson.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco




More information about the Python-list mailing list