[XML-SIG] XML Unicode and UTF-8

Mike Brown mike at skew.org
Thu Aug 5 22:27:29 CEST 2004


Paul Boddie wrote:
> Do this instead:
> 
>       utext = segment[0].decode( segment[1] )

The resulting Unicode object may contain characters which are not allowed in 
XML, and thus the text may not be serializable (at least not in a way that 
would produce well-formed XML).

To embed arbitrary bytes in XML, the usual advice is to first convert the 
bytes into a character sequence that is permitted in XML. Base64 is a popular 
and easily implemented option, albeit inefficient. The article at 
http://www.javaworld.com/javaworld/javatips/jw-javatip117-p2.html suggests 
that a custom Huffman implementation is nearly 1:1. I've mapped bytes into the 
Private Use Area of Unicode before, too, although that's definitely not 
efficient.


More information about the XML-SIG mailing list