[XML-SIG] Normalized AttVals

Andrew M. Kuchling akuchlin@cnri.reston.va.us
Mon, 14 Dec 1998 16:51:01 -0500 (EST)


John Day writes:
>Re: quoted attribute contents ("AttVal")
>When '>' is encountered e.g. <code op=">"> it is "normalized"
>to '&gt;', however, when '&' is encountered it is a fatal
>error e.g. <a href="www.zzz.com?a=1&b=3">
>
>Is this pyexpat behavior correct? Why can't the parser tell that
>'&b' above is _not_ a defined entity because it is not terminated
>by ';'? It seems to me that this usage could be normalized to
>'&amp;b', just like pyexpat did for '>'. Then it would be backward
>compatible with HTML (sort of).

	Actually, the fact that the above HTML href works is an
artifact of the error recovery in HTML parsers; you really are
supposed to write <a href="www.zzz.com?a=1&amp;b=3">.  There were some
lengthy threads about this in comp.infosystems.www.authoring.html a
few months ago, when someone found that in "a=1&section=4", their
browser was picking up &sect and turning it into a character, which
made the link not behave as expected.

	I think the XML community wishes to avoid depending on error
recovery in this way, because it leads to the same pit that HTML fell
into.  HTML parsers were really forgiving of invalid HTML, so few
authors bothered to check whether their HTML was valid, so you could
never, ever switch to using a stricter parser because so little of the
HTML in existence would be accepted by it.

-- 
A.M. Kuchling			http://starship.skyport.net/crew/amk/
And Herakles was full of it. He just got dead drunk for a couple of weeks in
Phrygia and told everyone he'd been to the land of the dead.
    -- Death, in SANDMAN: "The Song of Orpheus"