BeautifulSoup vs. real-world HTML comments - possible fix

John Nagle nagle at animats.com
Sun May 13 23:03:24 EDT 2007


Robert Kern wrote:
> Carl Banks wrote:
> 
>>On Apr 4, 4:55 pm, Robert Kern <robert.k... at gmail.com> wrote:
>>
>>>Carl Banks wrote:
>>>
>>>>On Apr 4, 2:43 pm, Robert Kern <robert.k... at gmail.com> wrote:
>>>>
>>>>>Carl Banks wrote:
>>>>>
>>>>>>On Apr 4, 2:08 pm, John Nagle <n... at animats.com> wrote:
>>>>>>
>>>>>>>BeautifulSoup can't parse this page usefully at all.
>>>>>>>It treats the entire page as a text chunk.  It's actually
>>>>>>>HTMLParser that parses comments, so this is really an HTMLParser
>>>>>>>level problem.

>>I think the authors of BeautifulSoup have the right to decide what
>>their own mission is.
> 
> 
> Yes, and he's stated it pretty clearly:
> 
> """You didn't write that awful page. You're just trying to get some data out of
> it. Right now, you don't really care what HTML is supposed to look like.
> 
> Neither does this parser."""

   That's a good summary of the issue.  It's a real problem, because
BeautifulSoup's default behavior in the presence of a bad comment is to
silently suck up all remaining text, ignoring HTML markup.

   The problem actually is in BeautifulSoup, in parse_declaration:

     def parse_declaration(self, i):
         """Treat a bogus SGML declaration as raw data. Treat a CDATA
         declaration as a CData object."""
         j = None
         if self.rawdata[i:i+9] == '<![CDATA[':
              k = self.rawdata.find(']]>', i)
              if k == -1:
                  k = len(self.rawdata)
              data = self.rawdata[i+9:k]
              j = k+3
              self._toStringSubclass(data, CData)
         else:
             try:
                 j = SGMLParser.parse_declaration(self, i)
             except SGMLParseError:
                 toHandle = self.rawdata[i:]
                 self.handle_data(toHandle)
                 j = i + len(toHandle)
         return j

Note what happens when a bad declaration is found.  SGMLParser.parse_declaration
raises SGMLParseError, and the exception handler just sucks up the rest of the
input (note that "rawdata[i:]"), treats it as unparsed data, and advances
the position to the end of input.

That's too brutal.  One bad declaration and the whole parse is messed up.
Something needs to be done at the BeautifulSoup level
to get the parser back on track.  Maybe suck up input until the next ">",
treat that as data, then continue parsing from that point.  That will do
the right thing most of the time, although bad declarations containing
a ">" will still be misparsed.

How about this patch?

             except SGMLParseError:              # bad decl, must recover
                 k = self.rawdata.find('>', i)   # find next ">"
                 if k == -1 :                    # if no find
                     k = len(self.rawdata)       # use entire string
                 toHandle = self.rawdata[i:k]    # take up to ">" as data
                 self.handle_data(toHandle)      # treat as data
                 j = i + len(toHandle)           # pick up parsing after ">"

This is untested, but this or something close to it should make
BeautifulSoup much more robust.

It might make sense to catch some SGMLParseError at some other
places, too, advance past the next ">", and restart parsing.

					John Nagle



More information about the Python-list mailing list