[New-bugs-announce] [issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode

Daniel Blanchard report at bugs.python.org
Tue Oct 6 18:39:09 CEST 2015


New submission from Daniel Blanchard:

As I recently discovered when someone filed a PR on chardet (see https://github.com/chardet/chardet/issues/70), BOMs are handled are not handled correctly by the endian-specific encodings UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE, but are by the UTF-16 and UTF-32 encodings.

For example:

>>> 'foo'.encode('utf-16le')
b'f\x00o\x00o\x00'
>>> 'foo'.encode('utf-16')
b'\xff\xfef\x00o\x00o\x00'

You can see that when using UTF-16 (instead of UTF-16LE), you get the BOM correctly prepended to the bytes.

If you were on a little endian system and purposefully wanted to create a UTF-16BE file, the only way to do it is:

>>> codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')
b'\xfe\xff\x00f\x00o\x00o'

This doesn't make a lot of sense to me.  Why is the BOM not prepended automatically when encoding with UTF-16BE?

Furthermore, if you were given a UTF-16BE file on a little endian system, you might think that this would be the correct way to decode it:

>>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16be')
'\ufefffoo'

but as you can see that leaves the BOM on there.  Strangely, decoding with UTF-16 works fine however:

>>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16')
'foo'

It seems to me that the endian-specific versions of UTF-16 and UTF-32 should be adding/removing the appropriate BOMs, and this is a long-standing bug.

----------
components: Unicode
messages: 252406
nosy: Daniel.Blanchard, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue25325>
_______________________________________


More information about the New-bugs-announce mailing list