[XML-SIG] Expat Unicode parsing mystery

A.M. Kuchling akuchlin@mems-exchange.org
Thu, 23 Mar 2000 21:11:18 -0500


I wanted to check that the PyExpat module works with the Unicode
support that's currently in the Python CVS tree, and found something
that I don't understand.  Consider the test program appended below.
It creates a little ASCII XML file that claims to use a specified
encoding.  The ASCII string is then converted to a Unicode object, and
then into the specified encoding.  All 3 strings are then parsed, and
the results printed.  With the encoding set to utf-16, this is the output:

[amk@mira extensions]$ ./python t.py
Parsing ASCII data
encoding specified in XML declaration is incorrect

	 # OK; Expat notices this is ASCII, and that the encoding is lying

Parsing Unicode string
('root', {})

	 # Huh?  Why does this work?  Python doesn't use UTF-16 internally!

Parsing UTF-16 encoded string
('root', {})
         # OK; this should obviously work.

I have a feeling I'm missing something here.  Can anyone explain why
the second case doesn't fail?

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
And Herakles was full of it. He just got dead drunk for a couple of weeks in
Phrygia and told everyone he'd been to the land of the dead.
    -- Death, in SANDMAN: "The Song of Orpheus"

from xml.parsers import pyexpat

encoding = 'utf-16'

asc_str = '<?xml version="1.0" encoding="%s"?><root/>' % encoding
u_str = unicode( asc_str )
encoded = u_str.encode(encoding)

def f(*args):
    print args

print 'Parsing ASCII data'
p = pyexpat.ParserCreate() ; p.StartElementHandler = f
res=p.Parse(asc_str, 1)
if not res: print pyexpat.ErrorString( p.ErrorCode )

print 'Parsing Unicode string'
p = pyexpat.ParserCreate() ; p.StartElementHandler = f
res=p.Parse(u_str, 1)
if not res: print pyexpat.ErrorString( p.ErrorCode )

print 'Parsing UTF-16 encoded string'
p = pyexpat.ParserCreate() ; p.StartElementHandler = f
res=p.Parse(encoded, 1)
if not res: print pyexpat.ErrorString( p.ErrorCode )