elementtree and gbk encoding

Diez B. Roggisch deets at nospam.web.de
Tue Mar 14 16:57:45 EST 2006


Steven Bethard schrieb:
> I'm having trouble using elementtree with an XML file that has some 
> gbk-encoded text.  (I can't read Chinese, so I'm taking their word for 
> it that it's gbk-encoded.)  I always have trouble with encodings, so I'm 
> sure I'm just screwing something simple up.  Can anyone help me?
> 
> Here's the interactive session.  Sorry it's a little verbose, but I 
> figured it would be better to include too much than not enough.  I 
> basically expected et.ElementTree(file=...) to fail since no encoding 
> was specified, but I don't know what I'm doing wrong when I use 
> codecs.open(...)

The first and most important lesson to learn here is that well-formed 
XML must contain a xml-header that specifies the used encoding. This has 
two consequences for you:

  1) all xml-parsers expect byte-strings, as they have to first read the 
header to know what encoding awaits them. So no use reading the xml-file 
with a codec - even if it is the right one. It will get converted back 
to a string when fed to the parser, with the default codec being used - 
resulting in  the well-known unicode error.

  2) your xml is _not_ well-formed, as it doesn't contain a xml-header! 
You need ask these guys to deliver the xml with header. Of course for 
now it is ok to just prepend the text with something like <?xml 
version="1.0" encoding="gbk"?>. But I'd still request them to deliver it 
with that header - otherwise it is _not_ XML, but just something that 
happens to look similar and doesn't guarantee to be well-formed and thus 
can be safely fed to a parser.


HTH Diez



More information about the Python-list mailing list