pyexpat and unicode

Alex Martelli aleax at aleax.it
Mon Dec 17 14:10:22 EST 2001


mallum wrote:
        ...
> data_uni = u"<?xml version='1.0' encoding='UTF-8' ?><hello>\202</hello>"
> data     = "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"
> 
> data_uni.encode('utf8')
> 
> parser.Parse(data)
> parser.Parse(data_uni)
        ...
> Does this mean Im unable to pass utf8 encoded strings to pyexpat ?
> According to the docs it should. Can anyone spread some light on this.

You can't, I believe, pass SOME strings with a certain encoding followed in 
the same parse by others with different encodings; or, as in this case, 
ones not in fact encoded (remember the call to .encode returns an encoded 
string, which you ignore -- it doesn't change data_uni, of course, as it's 
immutable, like all strings).

Separate parses work fine:

import xml.parsers.expat
parser = xml.parsers.expat.ParserCreate(encoding='utf8')

data_uni = u"<?xml version='1.0' encoding='UTF-8' ?><hello>\202</hello>"
data     = "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"

denc = data_uni.encode('utf8')

for thedata in data_uni, data, denc:
    parser = xml.parsers.expat.ParserCreate(encoding='utf8')
    print 'parsing', repr(thedata)
    parser.Parse(data, 1)
    print 'done'


Alex




More information about the Python-list mailing list