BeautifulSoup vs. real-world HTML comments

Paul Boddie paul at boddie.org.uk
Wed Apr 4 16:46:47 EDT 2007


John Nagle wrote:
> The syntax that browsers understand as HTML comments is much less
> restrictive than what BeautifulSoup understands.  I keep running into
> sites with formally incorrect HTML comments which are parsed happily
> by browsers.  Here's yet another example, this one from
> "http://www.webdirectory.com".  The page starts like this:
>
>
>  <!Hello there! Welcome to The Environment Directory!>
>  <!Not too much exciting HTML code here but it does the job! >
>  <!See ya, - JD >

Anything based on libxml2 and its HTML parser will handle such broken
HTML just fine, even if they just ignore such erroneous attempts at
comments, discarding them as the plain nonsense they clearly are.
Certainly, libxml2dom seems to deal with the page:

import libxml2dom
d = libxml2dom.parseURI("http://www.webdirectory.com", html=1,
htmlencoding="iso-8859-1")

I guess lxml and the original libxml2 bindings work at least as well.
Note that some browsers won't be as happy if you give them such
content as XHTML.

Paul




More information about the Python-list mailing list