[I18n-sig] UTF-8 and BOM

Toby Dickenson tdickenson@geminidataloggers.com
Mon, 21 May 2001 11:06:46 +0100


On Sat, 19 May 2001 11:35:18 -0400, Guido van Rossum
<guido@digicool.com> wrote:

>> The problem with BOMs is that they are supposed to appear at
>> the start of a string.
>
>Taken out of context, this strikes me as nonsense.  Strings in memory
>(Python Unicode strings anyway) have absolutely no need for a byte
>order mark since they are always in the right (native) byte order.

Thats true for Unicode strings.

However, a python plain string containing an encoded Unicode string
(in *any* character encoding) is no different to a file here - its
just a block-o-bytes.

>it is absurd to
>expect code dealing with *strings* to handle BOMs.

I agree with that, and is a good reason why the codecs should always
remove them.

"M.-A. Lemburg" <mal@lemburg.com> wrote:

>I'm still unsure whether I should change the UTF-16 decoder
>to only remove the BOM at the start of the stream -- the above
>case where BOMs are inserted due to string concatenation
>is very common (each .write() to a file will produce such
>a BOM mark).


Toby Dickenson
tdickenson@geminidataloggers.com