BeautifulSoup vs. real-world HTML comments

Carl Banks pavlovevidence at gmail.com
Wed Apr 4 14:28:42 EDT 2007


On Apr 4, 2:08 pm, John Nagle <n... at animats.com> wrote:
>     The syntax that browsers understand as HTML comments is much less
> restrictive than what BeautifulSoup understands.  I keep running into
> sites with formally incorrect HTML comments which are parsed happily
> by browsers.  Here's yet another example, this one from
> "http://www.webdirectory.com".  The page starts like this:
>
>         <!Hello there! Welcome to The Environment Directory!>
>         <!Not too much exciting HTML code here but it does the job! >
>         <!See ya, - JD >
>
>         <HTML><HEAD>
>         <TITLE>Environment Web Directory</TITLE>
>
> Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
> without problems.
>
> BeautifulSoup can't parse this page usefully at all.
> It treats the entire page as a text chunk.  It's actually
> HTMLParser that parses comments, so this is really an HTMLParser
> level problem.

Google for a program called "tidy".  Install it, and run it as a
filter on any HTML you download.  "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it.  It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.


Carl Banks




More information about the Python-list mailing list