BeautifulSoup bug when ">>>" found in attribute value

Duncan Booth duncan.booth at invalid.invalid
Wed Dec 27 04:27:10 EST 2006


John Nagle <nagle at animats.com> wrote:

> And this came out, via prettify:
> 
><addresssnippet siteurl="http%3A//apartmentsapart.com" 
> url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
>      <param name="movie"
>      value="/images/offersBanners/sw04.swf?binfot=We offer 
> fantastic rates for selected weeks or days!!&blinkt=Click here 
> >>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
> >>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
></param>
> 
> BeautifulSoup seems to have become confused by the ">>>" within
> a quoted attribute value.  It first parsed it right, but then stuck
> in an extra, totally bogus line.  Note the entity "&linkurl;", which
> appears nowhere in the original.  It looks like code to handle a
> missing quote mark did the wrong thing.

I don't think I would quibble with what BeautifulSoup extracted from that 
mess. The input isn't valid HTML so any output has to be guessing at what 
was meant. A lot of code for parsing html would assume that there was a 
quote missing and the tag was terminated by the first '>'. IE and Firefox 
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup 
seems to have given you the best of both worlds: the attribute is parsed to 
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find  it is just 
being nice and cleaning up an unterminated entity. Browsers (or at least 
IE) will often accept entities without the terminating semicolon, so that's 
a common problem in badly formed html that BeautifulSoup can fix.




More information about the Python-list mailing list