pyexpat and unicode

Alex Martelli aleax at aleax.it
Mon Dec 17 18:49:57 EST 2001


mallum wrote:

> 
> Nope. This still breaks, with the same error;
> 
> import xml.parsers.expat
> parser = xml.parsers.expat.ParserCreate(encoding='utf8')
> 
> data_uni = u"<?xml version='1.0' encoding='UTF-8'?><hello>\202</hello>"
> data_uni.encode('utf8')
> parser.Parse(data_uni)
> 
> Is this a Bug ?

Hmmm, yes, my own bug first of all:
        ...
>> for thedata in data_uni, data, denc:
>>     parser = xml.parsers.expat.ParserCreate(encoding='utf8')
>>     print 'parsing', repr(thedata)
>>     parser.Parse(data, 1)
>>     print 'done'

I was parsing data each time instead of thedata.  Correcting this silly 
error does show problems:

import sys
import xml.parsers.expat
parser = xml.parsers.expat.ParserCreate(encoding='utf8')

data_uni = u"<?xml version='1.0' encoding='UTF-8' ?><hello>\202</hello>"
data     = "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"

denc = data_uni.encode('utf8')

for thedata in data_uni, data, denc:
    parser = xml.parsers.expat.ParserCreate(encoding='utf8')
    print 'parsing', repr(thedata)
    try: parser.Parse(thedata, 1)
    except:
        print 'oops', sys.exc_info()[0]
    print 'done'

[alex at arthur alex]$ python a.py
parsing u"<?xml version='1.0' encoding='UTF-8' ?><hello>\x82</hello>"
oops exceptions.UnicodeError
done
parsing "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"
done
parsing "<?xml version='1.0' encoding='UTF-8' ?><hello>\xc2\x82</hello>"
oops xml.parsers.expat.ExpatError
done
[alex at arthur alex]$

The first one corresponds to what you're seeing (passing unicode data tries 
to encode it with your default encoding, and the default's default is 
ansi), the second one is a string that first within the 'ansi' subset of 
utf-8... and I don't know what to make of the third one, which I thought 
would work.


Alex




More information about the Python-list mailing list