REALLY simple xml reader

Stefan Behnel stefan_ml at behnel.de
Sat Feb 2 01:24:36 EST 2008


Steven D'Aprano wrote:
> The same way it knows that "<?xml" is "<?xml" before it sees the 
> encoding. If the parser knows that the hex bytes 
> 
> 3c 3f 78 6d 6c
> 
> (or 3c 00 3f 00 78 00 6d 00 6c 00 if you prefer UTF-16, and feel free to 
> swap the byte order)
> 
> mean "<?xml"
> 
> then it can equally know that bytes 
> 
> 20 09 0a 
> 
> are whitespace. According to the XML standard, what else could they be?

So, what about all the other unicode whitespace characters? And what about
different encodings and byte orders that move the bytes around? Is it ok for a
byte stream to start with "00 20" or does it have to start with "20 00"? What
about "00 20 00 00" and "00 00 00 20"? Are you sure that means 0x20 encoded in
4 bytes, or is it actually the unicode character 0x2000? What complexity do
you want to put into the parser here?

"In the face of ambiguity, refuse the temptation to guess"

Stefan



More information about the Python-list mailing list