elementtree and gbk encoding

Fredrik Lundh fredrik at pythonware.com
Wed Mar 15 00:45:02 EST 2006


Steven Bethard wrote:

> I'm having trouble using elementtree with an XML file that has some
> gbk-encoded text.  (I can't read Chinese, so I'm taking their word for
> it that it's gbk-encoded.)  I always have trouble with encodings, so I'm
> sure I'm just screwing something simple up.  Can anyone help me?

absolutely!

pyexpat has only limited support for non-standard encodings; the core
expat library only supports UTF-8, UTF-16, US-ASCII, and ISO-8859-1,
and the Python glue layer then adds support for all byte-to-byte en-
codings support by Python on top of that.

if you're using any other encoding, you need to recode the file on the
way in (just decoding to Unicode doesn't work, since the parser expects
an encoded byte stream).  the approach shown on this page should work

    http://effbot.org/zone/celementtree-encoding.htm

except that it uses the new XMLParser interface which isn't available in
ET 1.2.6, and the corresponding XMLTreeBuilder interface in ET doesn't
support the encoding override argument...

the easiest way to fix this is to modify the file header on the way in; if
the file has an <?xml encoding?> header, rip out the header and recode
from that encoding to utf-8 while parsing.

</F>






More information about the Python-list mailing list