BeautifulSoup vs. real-world HTML comments
Steve Holden
steve at holdenweb.com
Wed Apr 4 14:48:32 EDT 2007
Carl Banks wrote:
> On Apr 4, 2:08 pm, John Nagle <n... at animats.com> wrote:
>> The syntax that browsers understand as HTML comments is much less
>> restrictive than what BeautifulSoup understands. I keep running into
>> sites with formally incorrect HTML comments which are parsed happily
>> by browsers. Here's yet another example, this one from
>> "http://www.webdirectory.com". The page starts like this:
>>
>> <!Hello there! Welcome to The Environment Directory!>
>> <!Not too much exciting HTML code here but it does the job! >
>> <!See ya, - JD >
>>
>> <HTML><HEAD>
>> <TITLE>Environment Web Directory</TITLE>
>>
>> Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
>> without problems.
>>
>> BeautifulSoup can't parse this page usefully at all.
>> It treats the entire page as a text chunk. It's actually
>> HTMLParser that parses comments, so this is really an HTMLParser
>> level problem.
>
> Google for a program called "tidy". Install it, and run it as a
> filter on any HTML you download. "tidy" has invested in it quite a
> bit of work understanding common bad HTML and how browsers deal with
> it. It would be pointless to duplicate that work in the Python
> standard library; let HTMLParser be small and tight, and outsource the
> handling of floozy input to a dedicated program.
>
>
eGenix have produced the mxTidy library that handily incorporates these
features in a way that makes them easy for Python programmers to use.
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com
More information about the Python-list
mailing list