[XML-SIG] parsing chinese characters

Stefan Behnel stefan_ml at behnel.de
Tue Oct 23 08:26:58 CEST 2007


Fabian López wrote:
> I am parsing an XML file that includes chineses characters, like ^
> 評評啖啖才是眞.細氺長锍才是愛 or ヘアアイロン... The problem is that I get an error like:
> UnicodeEncodeerror:'charmap' codec can't encode characters in position....
> The thing is that I would like to ignore it and parse all the characters
> less these ones. So, could anyone help me? I suppose that I can catch an
> exception that ignores it or maybe use any function that detects this
> chinese characters and after that ignore them.

If the parser can't handle the characters here, it's because the document is
broken and does not declare the correct encoding.

>From your last post, I assume you're using lxml to do this (it's always
helpful to state what software you use when you describe a problem with it).
Since 2.0alpha3(?), you can override the encoding of the parsed file with the
"encoding" keyword that you can pass to the XMLParser class. So, for example,
you can try parsing the document as usual and if that fails, try parsing it
with a different parser that is configured for a specific encoding override.
Or you can determine the encoding based on some external source (like what the
HTTP protocol tells you), and then use an override parser right away, or use
that information as the first fallback.

Stefan


More information about the XML-SIG mailing list