BeautifulSoup vs. real-world HTML comments

John Nagle nagle at animats.com
Wed Apr 4 14:08:27 EDT 2007


    The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands.  I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers.  Here's yet another example, this one from
"http://www.webdirectory.com".  The page starts like this:


	<!Hello there! Welcome to The Environment Directory!>
	<!Not too much exciting HTML code here but it does the job! >
	<!See ya, - JD >

	<HTML><HEAD>
	<TITLE>Environment Web Directory</TITLE>


Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk.  It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.


				John Nagle



More information about the Python-list mailing list