pyexpat and unicode
Alex Martelli
aleax at aleax.it
Mon Dec 17 18:49:57 EST 2001
mallum wrote:
>
> Nope. This still breaks, with the same error;
>
> import xml.parsers.expat
> parser = xml.parsers.expat.ParserCreate(encoding='utf8')
>
> data_uni = u"<?xml version='1.0' encoding='UTF-8'?><hello>\202</hello>"
> data_uni.encode('utf8')
> parser.Parse(data_uni)
>
> Is this a Bug ?
Hmmm, yes, my own bug first of all:
...
>> for thedata in data_uni, data, denc:
>> parser = xml.parsers.expat.ParserCreate(encoding='utf8')
>> print 'parsing', repr(thedata)
>> parser.Parse(data, 1)
>> print 'done'
I was parsing data each time instead of thedata. Correcting this silly
error does show problems:
import sys
import xml.parsers.expat
parser = xml.parsers.expat.ParserCreate(encoding='utf8')
data_uni = u"<?xml version='1.0' encoding='UTF-8' ?><hello>\202</hello>"
data = "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"
denc = data_uni.encode('utf8')
for thedata in data_uni, data, denc:
parser = xml.parsers.expat.ParserCreate(encoding='utf8')
print 'parsing', repr(thedata)
try: parser.Parse(thedata, 1)
except:
print 'oops', sys.exc_info()[0]
print 'done'
[alex at arthur alex]$ python a.py
parsing u"<?xml version='1.0' encoding='UTF-8' ?><hello>\x82</hello>"
oops exceptions.UnicodeError
done
parsing "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"
done
parsing "<?xml version='1.0' encoding='UTF-8' ?><hello>\xc2\x82</hello>"
oops xml.parsers.expat.ExpatError
done
[alex at arthur alex]$
The first one corresponds to what you're seeing (passing unicode data tries
to encode it with your default encoding, and the default's default is
ansi), the second one is a string that first within the 'ansi' subset of
utf-8... and I don't know what to make of the third one, which I thought
would work.
Alex
More information about the Python-list
mailing list