Mysterious xml.sax Encoding Exception

Peck, Jon peck at spss.com
Sat Feb 2 10:50:25 EST 2008


Yes, the characters were from the 0-127 ascii block but encoded as utf-16, so there is a null byte with each nonzero character.  I.e., \x00?\x00x\x00m\x00l\x00

Here is something weird I found while experimenting with ElementTree with this same XML string.

Consider the same XML as a Python Unicode string, so it is actually encoded as utf-16 and as a string containing utf-16 bytes.  That is
u'<?xml version="1.0" encoding="UTF-16" st' ...
or
'\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00'...

So if these are x and y
y = x.encode("utf-16")

The actual bytes would be the same, I think, although y is type str and x is type unicode.

xml.sax.parseString documentation says

parses from a buffer string received as a parameter, 

so one might imagine that either x or y would be acceptable, and the bytes would be interpreted according to the encoding declaration in the byte stream.

And, in fact, both do work with xml.sax.parseString (at least for me).  With etree.parse(StringIO.StringIO...) though, only the str form works.

Regards,
Jon Peck


-----Original Message-----
From: Jeroen Ruigrok van der Werven [mailto:asmodai at in-nomine.org] 
Sent: Saturday, February 02, 2008 12:57 AM
To: JKPeck
Cc: python-list at python.org
Subject: Re: Mysterious xml.sax Encoding Exception

-On [20080201 19:06], JKPeck (JKPeck at gmail.com) wrote:
>In both of these cases, there are only plain, 7-bit ascii characters
>in the xml, and it really is valid utf-16 as far as I can tell.

Did you mean to say that the only characters they used in the UTF-16 encoded
file are characters from the Basic Latin Unicode block? 

-- 
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/
We have met the enemy and they are ours...


More information about the Python-list mailing list