[I18n-sig] XML and UTF-16

M.-A. Lemburg mal@lemburg.com
Thu, 31 May 2001 19:39:17 +0200


Tom Emerson wrote:
> 
> M.-A. Lemburg writes:
> > What is the standard file layout to use for storing an XML file
> > in UTF-16 ?
> 
> I thought this was covered in the XML specification as a non-normative
> appendix. Maybe not.

I was too lazy to look it up :-)
 
> > 1) encode the whole file in UTF-16 (possibly prepended with a BOM)
> 
> Yes. You can then pretty easily autodetect the which Unicode
> transformation format is being used by looking at the first ten or
> so bytes.
> 
> If the BOM is present, that's a big clue right there.
> 
> UTF-16-BE will have the first "<?xml " encoded like
> 
> 003C 003F 0078 006D 006E
> 
> while UTF-16-LE will have it encoded as
> 
> 3C00 3F00 7800 6D00 6E00
> 
> ASCII and UTF-8 will just have
> 
> 3C 3F 78 6D 6E

Perhaps we should have some smart auto-detection API somewhere
which does this automagically ?! Something like

	guess_xml_encoding(data) -> encoding string

It could work by looking at the first 256 bytes of the data
string and then apply all the tricks needed to extract the
encoding information (or default to UTF-8 if no such information
is given).

> > 2) write the first line containing the XML header (which has the
> >    encoding information) in ASCII and then proceed with UTF-16
> >    starting after the newline character
> 
> Ugh, no.

Thought so :-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/