[I18n-sig] XML and UTF-16

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Fri, 1 Jun 2001 15:06:11 +0200


> > """Entities encoded in UTF-16 must begin with the Byte Order Mark
> > described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC
> > 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
> > (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
> > signature, not part of either the markup or the character data of the
> > XML document. XML processors must be able to use this character to
> > differentiate between UTF-8 and UTF-16 encoded documents."""
> 
> Where did you get that from ? 

That's from the XML recommendation, section 4.3.3. I really recommend
that you get a copy of that document :-)

> Note that the Unicode specs have a different opinion on this... (a
> BOM mark is part of a protocol and should only be used if the
> encoding information is not available in some other form or
> implicit)

Why is that different? XML says that the BOM is not part of the
document, but an encoding signature. You say that that it is part of a
protocol - in the XML case, it is part of the encoding autodetection
protocol.

If the character was part of the document, any document containing it
would be ill-formed, since the ZWNBSP is not allowed as the first
character of an XML document (only whitespace and '<' are allowed,
AFAICT).

Regards,
Martin