Mysterious xml.sax Encoding Exception

Stefan Behnel stefan_ml at behnel.de
Sat Feb 2 11:16:53 EST 2008


Peck, Jon schrieb:
> Yes, the characters were from the 0-127 ascii block but encoded as utf-16, so there is a null byte with each nonzero character.  I.e., \x00?\x00x\x00m\x00l\x00
> 
> Here is something weird I found while experimenting with ElementTree with this same XML string.
> 
> Consider the same XML as a Python Unicode string, so it is actually encoded as utf-16 and as a string containing utf-16 bytes.  That is
> u'<?xml version="1.0" encoding="UTF-16" st' ...
> or
> '\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00'...
> 
> So if these are x and y
> y = x.encode("utf-16")
> 
> The actual bytes would be the same, I think, although y is type str and x is type unicode.

No. The internal representation of unicode characters is platform dependent,
and is either 2 or 4 bytes per character. If you want UTF-16, use ".encode()".


> xml.sax.parseString documentation says
> 
> parses from a buffer string received as a parameter, 
> 
> so one might imagine that either x or y would be acceptable, and the bytes would be interpreted according to the encoding declaration in the byte stream.
> 
> And, in fact, both do work with xml.sax.parseString (at least for me).  With etree.parse(StringIO.StringIO...) though, only the str form works.

Don't try. Serialised XML is bytes, not characters.

Stefan



More information about the Python-list mailing list