ElementTree and Unicode

"Martin v. Löwis" martin at v.loewis.de
Wed Aug 2 15:23:59 EDT 2006


Sébastien Boisgérault schrieb:
> I am trying to embed an *arbitrary* (unicode) strings inside
> an XML document. Of course I'd like to be able to reconstruct
> it later from the xml document ... If the naive way to do it does
> not work, can anyone suggest a way to do it ?

XML does not support arbitrary Unicode characters; a few control
characters are excluded. See the definiton of Char in

http://www.w3.org/TR/2004/REC-xml-20040204

[2]     Char     ::=     #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]

Now, one might thing you could use a character reference
(e.g. �) to refer to the "missing" characters, but this is not so:


[66]  CharRef ::=  '&#' [0-9]+ ';'
                 | '&#x' [0-9a-fA-F]+ ';

    Well-formedness constraint: Legal Character
    Characters referred to using character references must match the
    production for Char.

As others have explained, if you want to transmit arbitrary characters,
you need to encode it as text in some way. One obvious solution
would be to encode the Unicode data as UTF-8 first, and then encode
the UTF-8 bytes using base64. The receiver of the XML document then
must do the reverse.

Regards,
Martin



More information about the Python-list mailing list