[Python-Dev] Unicode byte order mark decoding

Walter Dörwald walter at livinglogic.de
Tue Apr 5 12:31:06 CEST 2005


M.-A. Lemburg wrote:

>> [...]
>>With the UTF-8-SIG codec, it would apply to all operation modes of
>>the codec, whether stream-based or from strings. Whether or not to
>>use the codec would be the application's choice.
> 
> I'd suggest to use the same mode of operation as we have in
> the UTF-16 codec: it removes the BOM mark on the first call
> to the StreamReader .decode() method and writes a BOM mark
> on the first call to .encode() on a StreamWriter.
> 
> Note that the UTF-16 codec is strict w/r to the presence
> of the BOM mark: you get a UnicodeError if a stream does
> not start with a BOM mark. For the UTF-8-SIG codec, this
> should probably be relaxed to not require the BOM.

I've started writing such a codec. Making the BOM optional on decoding 
definitely simplifies the implementation.

Bye,
    Walter Dörwald


More information about the Python-Dev mailing list