[XML-SIG] Expat Unicode parsing mystery
A.M. Kuchling
akuchlin@mems-exchange.org
Thu, 23 Mar 2000 21:11:18 -0500
I wanted to check that the PyExpat module works with the Unicode
support that's currently in the Python CVS tree, and found something
that I don't understand. Consider the test program appended below.
It creates a little ASCII XML file that claims to use a specified
encoding. The ASCII string is then converted to a Unicode object, and
then into the specified encoding. All 3 strings are then parsed, and
the results printed. With the encoding set to utf-16, this is the output:
[amk@mira extensions]$ ./python t.py
Parsing ASCII data
encoding specified in XML declaration is incorrect
# OK; Expat notices this is ASCII, and that the encoding is lying
Parsing Unicode string
('root', {})
# Huh? Why does this work? Python doesn't use UTF-16 internally!
Parsing UTF-16 encoded string
('root', {})
# OK; this should obviously work.
I have a feeling I'm missing something here. Can anyone explain why
the second case doesn't fail?
--
A.M. Kuchling http://starship.python.net/crew/amk/
And Herakles was full of it. He just got dead drunk for a couple of weeks in
Phrygia and told everyone he'd been to the land of the dead.
-- Death, in SANDMAN: "The Song of Orpheus"
from xml.parsers import pyexpat
encoding = 'utf-16'
asc_str = '<?xml version="1.0" encoding="%s"?><root/>' % encoding
u_str = unicode( asc_str )
encoded = u_str.encode(encoding)
def f(*args):
print args
print 'Parsing ASCII data'
p = pyexpat.ParserCreate() ; p.StartElementHandler = f
res=p.Parse(asc_str, 1)
if not res: print pyexpat.ErrorString( p.ErrorCode )
print 'Parsing Unicode string'
p = pyexpat.ParserCreate() ; p.StartElementHandler = f
res=p.Parse(u_str, 1)
if not res: print pyexpat.ErrorString( p.ErrorCode )
print 'Parsing UTF-16 encoded string'
p = pyexpat.ParserCreate() ; p.StartElementHandler = f
res=p.Parse(encoded, 1)
if not res: print pyexpat.ErrorString( p.ErrorCode )