[I18n-sig] UTF-8 and BOM
Toby Dickenson
tdickenson@geminidataloggers.com
Mon, 21 May 2001 11:06:46 +0100
On Sat, 19 May 2001 11:35:18 -0400, Guido van Rossum
<guido@digicool.com> wrote:
>> The problem with BOMs is that they are supposed to appear at
>> the start of a string.
>
>Taken out of context, this strikes me as nonsense. Strings in memory
>(Python Unicode strings anyway) have absolutely no need for a byte
>order mark since they are always in the right (native) byte order.
Thats true for Unicode strings.
However, a python plain string containing an encoded Unicode string
(in *any* character encoding) is no different to a file here - its
just a block-o-bytes.
>it is absurd to
>expect code dealing with *strings* to handle BOMs.
I agree with that, and is a good reason why the codecs should always
remove them.
"M.-A. Lemburg" <mal@lemburg.com> wrote:
>I'm still unsure whether I should change the UTF-16 decoder
>to only remove the BOM at the start of the stream -- the above
>case where BOMs are inserted due to string concatenation
>is very common (each .write() to a file will produce such
>a BOM mark).
Toby Dickenson
tdickenson@geminidataloggers.com