[New-bugs-announce] [issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode
Daniel Blanchard
report at bugs.python.org
Tue Oct 6 18:39:09 CEST 2015
New submission from Daniel Blanchard:
As I recently discovered when someone filed a PR on chardet (see https://github.com/chardet/chardet/issues/70), BOMs are handled are not handled correctly by the endian-specific encodings UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE, but are by the UTF-16 and UTF-32 encodings.
For example:
>>> 'foo'.encode('utf-16le')
b'f\x00o\x00o\x00'
>>> 'foo'.encode('utf-16')
b'\xff\xfef\x00o\x00o\x00'
You can see that when using UTF-16 (instead of UTF-16LE), you get the BOM correctly prepended to the bytes.
If you were on a little endian system and purposefully wanted to create a UTF-16BE file, the only way to do it is:
>>> codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')
b'\xfe\xff\x00f\x00o\x00o'
This doesn't make a lot of sense to me. Why is the BOM not prepended automatically when encoding with UTF-16BE?
Furthermore, if you were given a UTF-16BE file on a little endian system, you might think that this would be the correct way to decode it:
>>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16be')
'\ufefffoo'
but as you can see that leaves the BOM on there. Strangely, decoding with UTF-16 works fine however:
>>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16')
'foo'
It seems to me that the endian-specific versions of UTF-16 and UTF-32 should be adding/removing the appropriate BOMs, and this is a long-standing bug.
----------
components: Unicode
messages: 252406
nosy: Daniel.Blanchard, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue25325>
_______________________________________
More information about the New-bugs-announce
mailing list