[Python-Dev] Unicode byte order mark decoding
Walter Dörwald
walter at livinglogic.de
Tue Apr 5 12:31:06 CEST 2005
M.-A. Lemburg wrote:
>> [...]
>>With the UTF-8-SIG codec, it would apply to all operation modes of
>>the codec, whether stream-based or from strings. Whether or not to
>>use the codec would be the application's choice.
>
> I'd suggest to use the same mode of operation as we have in
> the UTF-16 codec: it removes the BOM mark on the first call
> to the StreamReader .decode() method and writes a BOM mark
> on the first call to .encode() on a StreamWriter.
>
> Note that the UTF-16 codec is strict w/r to the presence
> of the BOM mark: you get a UnicodeError if a stream does
> not start with a BOM mark. For the UTF-8-SIG codec, this
> should probably be relaxed to not require the BOM.
I've started writing such a codec. Making the BOM optional on decoding
definitely simplifies the implementation.
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list