pyexpat and unicode
Alex Martelli
aleax at aleax.it
Mon Dec 17 14:10:22 EST 2001
mallum wrote:
...
> data_uni = u"<?xml version='1.0' encoding='UTF-8' ?><hello>\202</hello>"
> data = "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"
>
> data_uni.encode('utf8')
>
> parser.Parse(data)
> parser.Parse(data_uni)
...
> Does this mean Im unable to pass utf8 encoded strings to pyexpat ?
> According to the docs it should. Can anyone spread some light on this.
You can't, I believe, pass SOME strings with a certain encoding followed in
the same parse by others with different encodings; or, as in this case,
ones not in fact encoded (remember the call to .encode returns an encoded
string, which you ignore -- it doesn't change data_uni, of course, as it's
immutable, like all strings).
Separate parses work fine:
import xml.parsers.expat
parser = xml.parsers.expat.ParserCreate(encoding='utf8')
data_uni = u"<?xml version='1.0' encoding='UTF-8' ?><hello>\202</hello>"
data = "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"
denc = data_uni.encode('utf8')
for thedata in data_uni, data, denc:
parser = xml.parsers.expat.ParserCreate(encoding='utf8')
print 'parsing', repr(thedata)
parser.Parse(data, 1)
print 'done'
Alex
More information about the Python-list
mailing list