BeautifulSoup bug when ">>>" found in attribute value

John Nagle nagle at animats.com
Wed Dec 27 13:38:14 EST 2006


Duncan Booth wrote:
> John Nagle <nagle at animats.com> wrote:
> 
> 
>>And this came out, via prettify:
>>
>><addresssnippet siteurl="http%3A//apartmentsapart.com" 
>>url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
>>     <param name="movie"
>>     value="/images/offersBanners/sw04.swf?binfot=We offer 
>>fantastic rates for selected weeks or days!!&blinkt=Click here 
>>>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>
>>>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
>>
>></param>
>>
>>BeautifulSoup seems to have become confused by the ">>>" within
>>a quoted attribute value.  It first parsed it right, but then stuck
>>in an extra, totally bogus line.  Note the entity "&linkurl;", which
>>appears nowhere in the original.  It looks like code to handle a
>>missing quote mark did the wrong thing.
> 
> 
> I don't think I would quibble with what BeautifulSoup extracted from that 
> mess. The input isn't valid HTML so any output has to be guessing at what 
> was meant. A lot of code for parsing html would assume that there was a 
> quote missing and the tag was terminated by the first '>'. IE and Firefox 
> seem to assume that the '>' is allowed inside the attribute. BeautifulSoup 
> seems to have given you the best of both worlds: the attribute is parsed to 
> the closing quote, but the tag itself ends at the first '>'.
> 
> As for inserting a semicolon after linkurl, I think you'll find  it is just 
> being nice and cleaning up an unterminated entity. Browsers (or at least 
> IE) will often accept entities without the terminating semicolon, so that's 
> a common problem in badly formed html that BeautifulSoup can fix.

    It's worse than that.  Look at the last line of BeautifulSoup output:

	&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything.  We're outside a tag at that point.
And it was introduced by BeautifulSoup.  That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen.  This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.

					John Nagle




More information about the Python-list mailing list