[I18n-sig] XML and UTF-16
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Thu, 31 May 2001 22:46:31 +0200
> Yes, I think this would be a good idea. I would use something along
> the lines of:
Please have a look at
xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost
follows the procedure in the XML recommendation, except that it does
not expect "unusual" byte orders (2134, 3412), and that it does not
detect EBCDIC.
> 0) Assume UTF-8.
>
> 1) Look for the UTF-16 and UTF-32 uniBOMs. If you find one, assume the
> appropriate transmission format and endian nature. Goto 4.
>
> 2) Look for the UTF-8 uniBOM, since some editors like putting that in.
> Ignore it and goto 4.
I see this was added to the XML recommendation only in the second
edition, so I should also added to xmlproc.
> 3) Look for the sundry forms of '<?xml ' in ASCII, UTF-16, and UTF-32,
> with appropriate endian variants. If found, assume the detected
> encoding. Goto 4.
Please note that ASCII is not detectable this way: If you see '<?xml',
then you don't know anything about the encoding except that you should
be able to parse the encoding= attribute successfully if present.
Regards,
Martin