BeautifulSoup vs. real-world HTML comments

Robert Kern robert.kern at gmail.com
Wed Apr 4 14:43:37 EDT 2007


Carl Banks wrote:
> On Apr 4, 2:08 pm, John Nagle <n... at animats.com> wrote:

>> BeautifulSoup can't parse this page usefully at all.
>> It treats the entire page as a text chunk.  It's actually
>> HTMLParser that parses comments, so this is really an HTMLParser
>> level problem.
> 
> Google for a program called "tidy".  Install it, and run it as a
> filter on any HTML you download.  "tidy" has invested in it quite a
> bit of work understanding common bad HTML and how browsers deal with
> it.  It would be pointless to duplicate that work in the Python
> standard library; let HTMLParser be small and tight, and outsource the
> handling of floozy input to a dedicated program.

Well, BeautifulSoup is just such a dedicated library. However, it defers its
handling of comments to HTMLParser. That's the problem.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco




More information about the Python-list mailing list