REALLY simple xml reader

Sat Feb 2 05:44:39 EST 2008

On Sat, 02 Feb 2008 07:24:36 +0100, Stefan Behnel wrote:

> Steven D'Aprano wrote:
>> The same way it knows that "<?xml" is "<?xml" before it sees the
>> encoding. If the parser knows that the hex bytes
>> 
>> 3c 3f 78 6d 6c
>> 
>> (or 3c 00 3f 00 78 00 6d 00 6c 00 if you prefer UTF-16, and feel free
>> to swap the byte order)
>> 
>> mean "<?xml"
>> 
>> then it can equally know that bytes
>> 
>> 20 09 0a
>> 
>> are whitespace. According to the XML standard, what else could they be?
> 
> So, what about all the other unicode whitespace characters? 

What about them? They aren't part of the XML spec, which defines 
whitespace as the code points #x20, #x9, #xD and #xA. (Okay, I forgot 
carriage return. Oops.) You don't have to support arbitrary whitespace, 
only those four characters.

> And what
> about different encodings and byte orders that move the bytes around? 

What about them? The Byte Order Mark is optional in the case of UTF-8, 
and compulsory in the case of UTF-16. I quote:

"Entities encoded in UTF-16 must and entities encoded in UTF-8 may begin 
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], 
section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH 
NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not 
part of either the markup or the character data of the XML document. XML 
processors must be able to use this character to differentiate between 
UTF-8 and UTF-16 encoded documents."

So if your XML document is written in UTF-8, you don't need a BOM 
(although you can use one if you wish) and if it is in UTF-16 you *must* 
have one, even before the '<?xml'. If you don't, how will the parser 
recognise the characters '<?xml', not to mention the characters 
'encoding' and 'utf-16'?

> Is
> it ok for a byte stream to start with "00 20" or does it have to start
> with "20 00"? 

If you're using UTF-16, the byte stream MUST start with the BOM, so no, 
the above is illegal. If the BOM has already been seen, then it will tell 
the XML parser which order is legal, depending on whether the BOM was FF 
FE or FE FF.

If you're using UTF-8, the byte streams "00 20" and "20 00" would both be 
illegal: in UTF-8, the null byte is the unicode code point #x0, which is 
illegal in XML.

Support for any other encoding is entirely optional. A parser may choose 
to support other encodings, or not, and deal with them appropriately. But 
whatever encodings you support, the same issue comes up: if you can 
recognise '<?xml' before seeing the encoding, why can't you recognise 
whitespace?

> What about "00 20 00 00" and "00 00 00 20"? Are you sure
> that means 0x20 encoded in 4 bytes, or is it actually the unicode
> character 0x2000? What complexity do you want to put into the parser
> here?

I'm not putting any complexity into the parser that the XML standard 
doesn't already demand. Perhaps you should read it yourself:

http://www.w3.org/TR/xml/

In particular, note that a parser must be prepared to accept leading 
whitespace at the start of a document, and only reject it if it comes 
across a XML declaration.

> "In the face of ambiguity, refuse the temptation to guess"

What ambiguity, and what guess?

My earlier question wasn't rhetorical. I asked "According to the XML 
standard, what else could they [whitespace] be?". Just implying that they 
are ambiguous doesn't actually make them ambiguous. 

I don't believe there is an ambiguity at all. That's what makes the 
prohibition on leading whitespace before the '<?xml' tag all the more 
puzzling: there doesn't seem to be any good reason for it.

If I am wrong, then will somebody please put me out of my misery and tell 
me what leading whitespace could be mistaken for, in what circumstances?

-- 
Steven