REALLY simple xml reader

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Fri Feb 1 20:44:50 EST 2008


On Thu, 31 Jan 2008 18:35:17 +0100, Stefan Behnel wrote:

> Hi,
> 
> Steven D'Aprano wrote:
>> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:
>> 
>>> Quite apart from a human thinking it's pretty or not pretty, it's *not
>>> valid XML* if the XML declaration isn't immediately at the start of
>>> the document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many XML
>>> parsers will (correctly) reject such a document.
>> 
>> You know, I'd really like to know what the designers were thinking when
>> they made this decision.
> [had a good laugh here]
>> This is legal XML:
>> 
>> """<?xml version="1.0"?>
>> <greeting>Hello, world!</greeting>"""
>> 
>> and so is this:
>> 
>> """
>>      <greeting       >Hello, world!</greeting    >"""
>> 
>> 
>> but not this:
>> 
>> """ <?xml version="1.0"?>
>> <greeting>Hello, world!</greeting>"""
> 
> It's actually not that stupid. When you leave out the declaration, then
> the XML is UTF-8 encoded (by spec), so normal ASCII whitespace doesn't
> matter. It's just like the declaration had come *before* the whitespace,
> at the very beginning of the byte stream.
> 
> But if you add a declaration, then the encoding can change for the whole
> document (including the declaration!), so you have to give the parser a
> chance to actually parse the declaration. How is it supposed to know
> that the whitespace before the declaration *is* whitespace before it
> knows the encoding?

The same way it knows that "<?xml" is "<?xml" before it sees the 
encoding. If the parser knows that the hex bytes 

3c 3f 78 6d 6c

(or 3c 00 3f 00 78 00 6d 00 6c 00 if you prefer UTF-16, and feel free to 
swap the byte order)

mean "<?xml"

then it can equally know that bytes 

20 09 0a 

are whitespace. According to the XML standard, what else could they be?



-- 
Steven



More information about the Python-list mailing list