ElementTree and Unicode
Sébastien Boisgérault
Sebastien.Boisgerault at gmail.com
Wed Aug 2 18:58:57 EDT 2006
Martin v. Löwis wrote:
> Sébastien Boisgérault schrieb:
> > I am trying to embed an *arbitrary* (unicode) strings inside
> > an XML document. Of course I'd like to be able to reconstruct
> > it later from the xml document ... If the naive way to do it does
> > not work, can anyone suggest a way to do it ?
>
> XML does not support arbitrary Unicode characters; a few control
> characters are excluded. See the definiton of Char in
>
> http://www.w3.org/TR/2004/REC-xml-20040204
>
> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
> [#xE000-#xFFFD] | [#x10000-#x10FFFF]
>
> Now, one might thing you could use a character reference
> (e.g. ) to refer to the "missing" characters, but this is not so:
>
>
> [66] CharRef ::= '&#' [0-9]+ ';'
> | '&#x' [0-9a-fA-F]+ ';
>
> Well-formedness constraint: Legal Character
> Characters referred to using character references must match the
> production for Char.
>
> As others have explained, if you want to transmit arbitrary characters,
> you need to encode it as text in some way. One obvious solution
> would be to encode the Unicode data as UTF-8 first, and then encode
> the UTF-8 bytes using base64. The receiver of the XML document then
> must do the reverse.
>
> Regards,
> Martin
OK ! Thanks a lot for this helpful information.
Cheers,
SB
More information about the Python-list
mailing list