REALLY simple xml reader

Sat Feb 2 07:39:19 EST 2008

On Fri, 01 Feb 2008 07:51:56 +1100, Ben Finney wrote:

> Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
> 
>> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:
>> 
>> > Quite apart from a human thinking it's pretty or not pretty, it's
>> > *not valid XML* if the XML declaration isn't immediately at the start
>> > of the document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many
>> > XML parsers will (correctly) reject such a document.
>> 
>> You know, I'd really like to know what the designers were thinking when
>> they made this decision.
> 
> Probably much the same that the designers of the Unix shebang ("#!") or
> countless other "figure out whether the bitstream is a specific type"
> were thinking:

There's no real comparison with the shebang '#!'. It is important that 
the shell can recognise a shebang with a single look-up for speed, and 
the shell doesn't have to deal with the complexities of Unicode: if you 
write your script in UTF-16, bash will complain that it can't execute the 
binary file. The shell cares whether or not the first two bytes are 23 
21. An XML parser doesn't care about bytes, it cares about tags.

It isn't good enough for an XML parser to grab the first five bytes of a 
file and say "That's legal XML!" in the same way that the shell can look 
at the first two bytes of a script and say "That's a shebang!". An XML 
parser must actually *parse*, even to determine whether or not it is 
looking at XML. Any such parser must be prepared to accept leading 
whitespace at the beginning of a file, and only reject it once it reaches 
an XML declaration tag, if any. When parsing a stream of bytes like this:

ef bb bf 20 20 20 20 0a 09 3c 3f 78 6d 6c

the parser doesn't know it is illegal until it has seen the fourteenth 
byte. That's the worst of both worlds: you have to provisionally accept 
whitespace just in case the XML declaration is missing, so you don't save 
any complexity, but if the declaration is there, you reject a perfectly 
fine document for an apparently arbitrary reason.

> It's better to be as precise as possible so that failure can be
> unambiguous, than to have more-complex parsing rules that lead to
> ambiguity in implementation.

Precision and complexity are orthogonal attributes. "All valid documents 
must begin with the sequence of bytes representing the first 8093 digits 
of pi to the power of e in base 256" is very precise and completely 
unambiguous. There's one and only one byte sequence that satisfies such a 
requirement. But it is also very complex. On the other hand, "valid 
documents must begin with a number" is not complex at all, but very 
imprecise: what counts as a number? Is the word "one" a number?

A good example of how precision doesn't need to be the enemy of 
flexibility and simplicity: Python's rule dealing with imports from 
__future__ is precise. Any import from __future__ must be the first 
executable line in a module:

(1) There's no ambiguity. The first executable line is well-defined in 
the context of a Python program.

(2) The restriction is not arbitrary. There's a good technical reason for 
it, the rule doesn't needlessly restrict what you can do.

(3) It is human-friendly: you can precede the import by a shebang line, a 
doc string, any other bare strings (so long as they aren't assigned to a 
name), comments and empty lines.

-- 
Steven