[I18n-sig] UTF-8 and BOM
M.-A. Lemburg
mal@lemburg.com
Mon, 21 May 2001 19:02:35 +0200
"Martin v. Loewis" wrote:
>
> > That's hard to implement... how would the codec know where the
> > stream starts -- it only interfaces to the underyling stream
> > using .read() and .write() ?
>
> The stream readers and writers should assume that the first read and
> write operation use the ZWNBSP as the BOM, so they should stop giving
> a byte-order meaning to the BOM once they have seen the first chunk of
> data. That is best implemented by replacing the .encode function with
> utf_16_be/le_encode (as appropriate).
Patches are welcome :-)
> > Note that this only happens in the UTF-16 codec. All other codecs
> > pass through the BOMs as-is. Perhaps I should modify the UTF-16
> > codec to only remove BOMs when used in UTF-16 mode (without byte
> > order indication) and not in UTF-16-LE/UTF-16-BE mode ?!
>
> You may want to study the RFC just to be sure, but I think this is how
> UTF-16-[BL]E are defined.
According to the Unicode FAQ, BOM marks should only be used
where the byte order is not immediatly clear. In the case -LE and
-BE, this information is available, which is why the codecs
don't prepend a BOM mark.
Ok, I will modify the UTF-16-LE and -BE decoders to not remove
BOMs anymore and fix the UTF-16 decoder to only remove BOMs at
the start of the string. With these changes you should be able
to fix the UTF-16 stream codec to be more RFC compliant.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/