[I18n-sig] XML and UTF-16

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 31 May 2001 22:46:31 +0200


> Yes, I think this would be a good idea. I would use something along
> the lines of:

Please have a look at
xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost
follows the procedure in the XML recommendation, except that it does
not expect "unusual" byte orders (2134, 3412), and that it does not
detect EBCDIC.

> 0) Assume UTF-8.
> 
> 1) Look for the UTF-16 and UTF-32 uniBOMs. If you find one, assume the
>    appropriate transmission format and endian nature. Goto 4.
> 
> 2) Look for the UTF-8 uniBOM, since some editors like putting that in.
>    Ignore it and goto 4.

I see this was added to the XML recommendation only in the second
edition, so I should also added to xmlproc.

> 3) Look for the sundry forms of '<?xml ' in ASCII, UTF-16, and UTF-32,
>    with appropriate endian variants. If found, assume the detected
>    encoding. Goto 4.

Please note that ASCII is not detectable this way: If you see '<?xml',
then you don't know anything about the encoding except that you should
be able to parse the encoding= attribute successfully if present.


Regards,
Martin