REALLY simple xml reader

Stefan Behnel stefan_ml at behnel.de
Sat Feb 2 09:48:28 EST 2008


Steven D'Aprano wrote:
> On Fri, 01 Feb 2008 07:51:56 +1100, Ben Finney wrote:
> 
>> Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
>>
>>> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:
>>>
>>>> Quite apart from a human thinking it's pretty or not pretty, it's
>>>> *not valid XML* if the XML declaration isn't immediately at the start
>>>> of the document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many
>>>> XML parsers will (correctly) reject such a document.
>>> You know, I'd really like to know what the designers were thinking when
>>> they made this decision.
>> Probably much the same that the designers of the Unix shebang ("#!") or
>> countless other "figure out whether the bitstream is a specific type"
>> were thinking:
> 
> There's no real comparison with the shebang '#!'. It is important that 
> the shell can recognise a shebang with a single look-up for speed, and 
> the shell doesn't have to deal with the complexities of Unicode: if you 
> write your script in UTF-16, bash will complain that it can't execute the 
> binary file. The shell cares whether or not the first two bytes are 23 
> 21. An XML parser doesn't care about bytes, it cares about tags.

Or rather about unicode code points.

I actually think that you can compare the two. The shell can read the shebang,
recognise it, and continue reading up to the first newline to see what needs
to be done. That's one simple stream, no problem.

Same for the XML parser. It reads the stream and it will not have to look
back, even if the declaration requests a new encoding. Just like the shebang
has to be at the beginning, the declaration has to be there, too.

All I'm saying is that there is a point where you have to draw the line, and
the XML spec says, that the XML declaration must be at the beginning of the
document, and that it may be followed by whitespace. I think that's clear and
simple.

It admit that it's questionable if it should be allowed to omit the
declaration, but since there is only one case where you are allowed to do
that, I'm somewhat fine with this special case.

Stefan



More information about the Python-list mailing list