ElementTree and Unicode

Sébastien Boisgérault Sebastien.Boisgerault at gmail.com
Wed Aug 2 18:58:57 EDT 2006


Martin v. Löwis wrote:
> Sébastien Boisgérault schrieb:
> > I am trying to embed an *arbitrary* (unicode) strings inside
> > an XML document. Of course I'd like to be able to reconstruct
> > it later from the xml document ... If the naive way to do it does
> > not work, can anyone suggest a way to do it ?
>
> XML does not support arbitrary Unicode characters; a few control
> characters are excluded. See the definiton of Char in
>
> http://www.w3.org/TR/2004/REC-xml-20040204
>
> [2]     Char     ::=     #x9 | #xA | #xD | [#x20-#xD7FF] |
> [#xE000-#xFFFD] | [#x10000-#x10FFFF]
>
> Now, one might thing you could use a character reference
> (e.g. �) to refer to the "missing" characters, but this is not so:
>
>
> [66]  CharRef ::=  '&#' [0-9]+ ';'
>                  | '&#x' [0-9a-fA-F]+ ';
>
>     Well-formedness constraint: Legal Character
>     Characters referred to using character references must match the
>     production for Char.
>
> As others have explained, if you want to transmit arbitrary characters,
> you need to encode it as text in some way. One obvious solution
> would be to encode the Unicode data as UTF-8 first, and then encode
> the UTF-8 bytes using base64. The receiver of the XML document then
> must do the reverse.
>
> Regards,
> Martin

OK ! Thanks a lot for this helpful information.

Cheers,

SB




More information about the Python-list mailing list