BeautifulSoup bug when ">>>" found in attribute value

Duncan Booth duncan.booth at invalid.invalid
Wed Dec 27 13:38:57 EST 2006


John Nagle <nagle at animats.com> wrote:

>     It's worse than that.  Look at the last line of BeautifulSoup
>     output: 
> 
>      &linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
> 
> That "/>" doesn't match anything.  We're outside a tag at that point.
> And it was introduced by BeautifulSoup.  That's both wrong and
> puzzling; given that this was created from a parse tree, that type
> of error shouldn't ever happen.  This looks like the parser didn't
> delete a string item after deciding it was actually part of a tag.

The /> was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />

You don't actually *have* to escape > when it appears in html.

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first > although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</param> to close the unclosed param tag. 

... some time later ...

Ok, it looks like I was wrong and this is a bug in BeautifulSoup: it
seems that it *is* legal to have an unescaped > in an attribute value,
although it should (not must) be escaped: 

>From the HTML 4.01 spec:
> Similarly, authors should use ">" (ASCII decimal 62) in text
> instead of ">" to avoid problems with older user agents that
> incorrectly perceive this as the end of a tag (tag close delimiter)
> when it appears in quoted attribute values. 

Thank you, it looks like I just learned something new.

Mind you, the sentence before that says 'should' for quoting < characters 
which is just plain silly.



More information about the Python-list mailing list