Mysterious xml.sax Encoding Exception

Sat Feb 2 11:16:53 EST 2008

Peck, Jon schrieb:
> Yes, the characters were from the 0-127 ascii block but encoded as utf-16, so there is a null byte with each nonzero character.  I.e., \x00?\x00x\x00m\x00l\x00
> 
> Here is something weird I found while experimenting with ElementTree with this same XML string.
> 
> Consider the same XML as a Python Unicode string, so it is actually encoded as utf-16 and as a string containing utf-16 bytes.  That is
> u'<?xml version="1.0" encoding="UTF-16" st' ...
> or
> '\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00'...
> 
> So if these are x and y
> y = x.encode("utf-16")
> 
> The actual bytes would be the same, I think, although y is type str and x is type unicode.

No. The internal representation of unicode characters is platform dependent,
and is either 2 or 4 bytes per character. If you want UTF-16, use ".encode()".

> xml.sax.parseString documentation says
> 
> parses from a buffer string received as a parameter, 
> 
> so one might imagine that either x or y would be acceptable, and the bytes would be interpreted according to the encoding declaration in the byte stream.
> 
> And, in fact, both do work with xml.sax.parseString (at least for me).  With etree.parse(StringIO.StringIO...) though, only the str form works.

Don't try. Serialised XML is bytes, not characters.

Stefan