elementtree and gbk encoding
Diez B. Roggisch
deets at nospam.web.de
Tue Mar 14 16:57:45 EST 2006
Steven Bethard schrieb:
> I'm having trouble using elementtree with an XML file that has some
> gbk-encoded text. (I can't read Chinese, so I'm taking their word for
> it that it's gbk-encoded.) I always have trouble with encodings, so I'm
> sure I'm just screwing something simple up. Can anyone help me?
>
> Here's the interactive session. Sorry it's a little verbose, but I
> figured it would be better to include too much than not enough. I
> basically expected et.ElementTree(file=...) to fail since no encoding
> was specified, but I don't know what I'm doing wrong when I use
> codecs.open(...)
The first and most important lesson to learn here is that well-formed
XML must contain a xml-header that specifies the used encoding. This has
two consequences for you:
1) all xml-parsers expect byte-strings, as they have to first read the
header to know what encoding awaits them. So no use reading the xml-file
with a codec - even if it is the right one. It will get converted back
to a string when fed to the parser, with the default codec being used -
resulting in the well-known unicode error.
2) your xml is _not_ well-formed, as it doesn't contain a xml-header!
You need ask these guys to deliver the xml with header. Of course for
now it is ok to just prepend the text with something like <?xml
version="1.0" encoding="gbk"?>. But I'd still request them to deliver it
with that header - otherwise it is _not_ XML, but just something that
happens to look similar and doesn't guarantee to be well-formed and thus
can be safely fed to a parser.
HTH Diez
More information about the Python-list
mailing list