BeautifulSoup bug when ">>>" found in attribute value
Duncan Booth
duncan.booth at invalid.invalid
Wed Dec 27 04:27:10 EST 2006
John Nagle <nagle at animats.com> wrote:
> And this came out, via prettify:
>
><addresssnippet siteurl="http%3A//apartmentsapart.com"
> url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
> <param name="movie"
> value="/images/offersBanners/sw04.swf?binfot=We offer
> fantastic rates for selected weeks or days!!&blinkt=Click here
> >>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
> >>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
></param>
>
> BeautifulSoup seems to have become confused by the ">>>" within
> a quoted attribute value. It first parsed it right, but then stuck
> in an extra, totally bogus line. Note the entity "&linkurl;", which
> appears nowhere in the original. It looks like code to handle a
> missing quote mark did the wrong thing.
I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.
As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.
More information about the Python-list
mailing list