elementtree and gbk encoding

Wed Mar 15 13:41:17 EST 2006

Fredrik Lundh wrote:
> Steven Bethard wrote:
> 
>> I'm having trouble using elementtree with an XML file that has some
>> gbk-encoded text.  (I can't read Chinese, so I'm taking their word for
>> it that it's gbk-encoded.)  I always have trouble with encodings, so I'm
>> sure I'm just screwing something simple up.  Can anyone help me?
> 
> absolutely!
> 
> pyexpat has only limited support for non-standard encodings; the core
> expat library only supports UTF-8, UTF-16, US-ASCII, and ISO-8859-1,
> and the Python glue layer then adds support for all byte-to-byte en-
> codings support by Python on top of that.
> 
> if you're using any other encoding, you need to recode the file on the
> way in (just decoding to Unicode doesn't work, since the parser expects
> an encoded byte stream).  the approach shown on this page should work
> 
>     http://effbot.org/zone/celementtree-encoding.htm
> 
> except that it uses the new XMLParser interface which isn't available in
> ET 1.2.6, and the corresponding XMLTreeBuilder interface in ET doesn't
> support the encoding override argument...
> 
> the easiest way to fix this is to modify the file header on the way in; if
> the file has an <?xml encoding?> header, rip out the header and recode
> from that encoding to utf-8 while parsing.

Hmm...  I downloaded the newest cElementTree (and I already had the 
newest ElementTree), and here's what I get:

 >>> def myparser(file, encoding):
...     f = codecs.open(file, "r", encoding)
...     p = ET.XMLParser(encoding="utf-8")
...     while 1:
...         s = f.read(65536)
...         if not s:
...             break
...         p.feed(s.encode("utf-8"))
...     return ET.ElementTree(p.close())
...
 >>> tree = myparser(filename, 'gbk')
Traceback (most recent call last):
   File "<interactive input>", line 1, in ?
   File "<interactive input>", line 8, in myparser
SyntaxError: not well-formed (invalid token): line 8, column 6

FWIW, the file used above doesn't have an <?xml encoding?> header:

 >>> open(filename).read()
'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n 
<DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>\n( (IP-HLN 
(LCP-TMP (IP (NP-PN-SBJ (NR \xb7\xfc\xc3\xf7\xcf\xbc)) \n\t\t       (VP 
(VV \xbb\xf1\xb5\xc3) \n\t\t\t   (NP-OBJ (NN \xc5\xae\xd7\xd3) 
\n\t\t\t\t   (NN \xcc\xf8\xcc\xa8) \n\t\t\t\t   (NN \xcc\xf8\xcb\xae) 
\n\t\t\t\t   (NN \xb9\xda\xbe\xfc)))) \n\t\t   (LC \xba\xf3)) \n 
   (PU \xa3\xac) \n          (NP-SBJ (NP-PN (NR 
\xcb\xd5\xc1\xaa\xb6\xd3)) \n                  (NP (NN 
\xbd\xcc\xc1\xb7))) \n          (VP (ADVP (AD \xc8\xc8\xc7\xe9)) \n 
          (PP-DIR (P \xcf\xf2) \n\t\t      (NP (PN \xcb\xfd))) \n 
        (VP (VV \xd7\xa3\xba\xd8))) \n          (PU \xa1\xa3)) ) 
\n</S>\n<S ID=2567>\n( (FRAG  (NR \xd0\xc2\xbb\xaa\xc9\xe7) \n 
(NN \xbc\xc7\xd5\xdf) \n         (NR \xb3\xcc\xd6\xc1\xc9\xc6) \n 
   (VV \xc9\xe3) )) \n</S>\n</HEADLINE>\n<TEXT>\n</TEXT>\n</BODY>\n</DOC>\n'

STeVe