elementtree and gbk encoding

Tue Mar 14 17:10:55 EST 2006

Diez B. Roggisch wrote:
> Steven Bethard schrieb:
>> I'm having trouble using elementtree with an XML file that has some 
>> gbk-encoded text.  (I can't read Chinese, so I'm taking their word for 
>> it that it's gbk-encoded.)  I always have trouble with encodings, so 
>> I'm sure I'm just screwing something simple up.  Can anyone help me?
>>
>> Here's the interactive session.  Sorry it's a little verbose, but I 
>> figured it would be better to include too much than not enough.  I 
>> basically expected et.ElementTree(file=...) to fail since no encoding 
>> was specified, but I don't know what I'm doing wrong when I use 
>> codecs.open(...)
> 
> The first and most important lesson to learn here is that well-formed 
> XML must contain a xml-header that specifies the used encoding. This has 
> two consequences for you:
> 
>  1) all xml-parsers expect byte-strings, as they have to first read the 
> header to know what encoding awaits them. So no use reading the xml-file 
> with a codec - even if it is the right one. It will get converted back 
> to a string when fed to the parser, with the default codec being used - 
> resulting in  the well-known unicode error.
> 
>  2) your xml is _not_ well-formed, as it doesn't contain a xml-header! 
> You need ask these guys to deliver the xml with header. Of course for 
> now it is ok to just prepend the text with something like <?xml 
> version="1.0" encoding="gbk"?>. But I'd still request them to deliver it 
> with that header - otherwise it is _not_ XML, but just something that 
> happens to look similar and doesn't guarantee to be well-formed and thus 
> can be safely fed to a parser.

Thanks, that's very helpful.  I'll definitely harrass the people 
producing these files to make sure they put encoding declarations in them.

Here's what I get with the prepending hack:

 >>> et.fromstring('<?xml version="1.0" encoding="gbk"?>\n' + 
open(filename).read())
Traceback (most recent call last):
   File "<interactive input>", line 1, in ?
   File "C:\Program 
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 960, in XML
     parser.feed(text)
   File "C:\Program 
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242, 
in feed
     self._parser.Parse(data, 0)
ExpatError: unknown encoding: line 1, column 30

Are the XML encoding names different from the Python ones?  The "gbk" 
encoding seems to work okay from Python:

 >>> open(filename).read().decode('gbk')
u'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n 
<DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>\n( (IP-HLN 
(LCP-TMP (IP (NP-PN-SBJ (NR \u4f0f\u660e\u971e)) \n\t\t       (VP (VV 
\u83b7\u5f97) \n\t\t\t   (NP-OBJ (NN \u5973\u5b50) \n\t\t\t\t   (NN 
\u8df3\u53f0) \n\t\t\t\t   (NN \u8df3\u6c34) \n\t\t\t\t   (NN 
\u51a0\u519b)))) \n\t\t   (LC \u540e)) \n          (PU \uff0c) \n 
    (NP-SBJ (NP-PN (NR \u82cf\u8054\u961f)) \n                  (NP (NN 
\u6559\u7ec3))) \n          (VP (ADVP (AD \u70ed\u60c5)) \n 
  (PP-DIR (P \u5411) \n\t\t      (NP (PN \u5979))) \n              (VP 
(VV \u795d\u8d3a))) \n          (PU \u3002)) ) \n</S>\n<S ID=2567>\n( 
(FRAG  (NR \u65b0\u534e\u793e) \n         (NN \u8bb0\u8005) \n 
(NR \u7a0b\u81f3\u5584) \n         (VV \u6444) )) 
\n</S>\n</HEADLINE>\n<TEXT>\n</TEXT>\n</BODY>\n</DOC>\n'

STeve