ElementTree and Unicode
"Martin v. Löwis"
martin at v.loewis.de
Wed Aug 2 15:23:59 EDT 2006
Sébastien Boisgérault schrieb:
> I am trying to embed an *arbitrary* (unicode) strings inside
> an XML document. Of course I'd like to be able to reconstruct
> it later from the xml document ... If the naive way to do it does
> not work, can anyone suggest a way to do it ?
XML does not support arbitrary Unicode characters; a few control
characters are excluded. See the definiton of Char in
http://www.w3.org/TR/2004/REC-xml-20040204
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]
Now, one might thing you could use a character reference
(e.g. ) to refer to the "missing" characters, but this is not so:
[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';
Well-formedness constraint: Legal Character
Characters referred to using character references must match the
production for Char.
As others have explained, if you want to transmit arbitrary characters,
you need to encode it as text in some way. One obvious solution
would be to encode the Unicode data as UTF-8 first, and then encode
the UTF-8 bytes using base64. The receiver of the XML document then
must do the reverse.
Regards,
Martin
More information about the Python-list
mailing list