[issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode

Daniel Blanchard report at bugs.python.org
Tue Oct 6 14:21:47 EDT 2015


Daniel Blanchard added the comment:

Thanks for straightening me out there! I had not noticed this in the Unicode FAQ before:

>  Where the data has an associated type, such as a field in a database, a BOM is unnecessary. In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted. Any U+FEFF would be interpreted as a ZWNBSP.

Anyway, the thing that brought this up is that in chardet we detect codecs of files for people and we've been returning UTF-16BE or UTF-16LE when we detect the BOM at the front of the file, but we recently learned that if people tried to decode with those codecs things don't work as expected.  It seems the correct behavior in our case is to just return UTF-16 in these cases.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue25325>
_______________________________________


More information about the Python-bugs-list mailing list