REALLY simple xml reader
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Sat Feb 2 07:39:19 EST 2008
On Fri, 01 Feb 2008 07:51:56 +1100, Ben Finney wrote:
> Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
>
>> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:
>>
>> > Quite apart from a human thinking it's pretty or not pretty, it's
>> > *not valid XML* if the XML declaration isn't immediately at the start
>> > of the document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many
>> > XML parsers will (correctly) reject such a document.
>>
>> You know, I'd really like to know what the designers were thinking when
>> they made this decision.
>
> Probably much the same that the designers of the Unix shebang ("#!") or
> countless other "figure out whether the bitstream is a specific type"
> were thinking:
There's no real comparison with the shebang '#!'. It is important that
the shell can recognise a shebang with a single look-up for speed, and
the shell doesn't have to deal with the complexities of Unicode: if you
write your script in UTF-16, bash will complain that it can't execute the
binary file. The shell cares whether or not the first two bytes are 23
21. An XML parser doesn't care about bytes, it cares about tags.
It isn't good enough for an XML parser to grab the first five bytes of a
file and say "That's legal XML!" in the same way that the shell can look
at the first two bytes of a script and say "That's a shebang!". An XML
parser must actually *parse*, even to determine whether or not it is
looking at XML. Any such parser must be prepared to accept leading
whitespace at the beginning of a file, and only reject it once it reaches
an XML declaration tag, if any. When parsing a stream of bytes like this:
ef bb bf 20 20 20 20 0a 09 3c 3f 78 6d 6c
the parser doesn't know it is illegal until it has seen the fourteenth
byte. That's the worst of both worlds: you have to provisionally accept
whitespace just in case the XML declaration is missing, so you don't save
any complexity, but if the declaration is there, you reject a perfectly
fine document for an apparently arbitrary reason.
> It's better to be as precise as possible so that failure can be
> unambiguous, than to have more-complex parsing rules that lead to
> ambiguity in implementation.
Precision and complexity are orthogonal attributes. "All valid documents
must begin with the sequence of bytes representing the first 8093 digits
of pi to the power of e in base 256" is very precise and completely
unambiguous. There's one and only one byte sequence that satisfies such a
requirement. But it is also very complex. On the other hand, "valid
documents must begin with a number" is not complex at all, but very
imprecise: what counts as a number? Is the word "one" a number?
A good example of how precision doesn't need to be the enemy of
flexibility and simplicity: Python's rule dealing with imports from
__future__ is precise. Any import from __future__ must be the first
executable line in a module:
(1) There's no ambiguity. The first executable line is well-defined in
the context of a Python program.
(2) The restriction is not arbitrary. There's a good technical reason for
it, the rule doesn't needlessly restrict what you can do.
(3) It is human-friendly: you can precede the import by a shebang line, a
doc string, any other bare strings (so long as they aren't assigned to a
name), comments and empty lines.
--
Steven
More information about the Python-list
mailing list