Python 3.0 automatic decoding of UTF16

John Machin sjmachin at lexicon.net
Sat Dec 6 17:34:28 EST 2008


On Dec 7, 9:01 am, David Bolen <db3l.... at gmail.com> wrote:
> Johannes Bauer <dfnsonfsdu... at gmx.de> writes:
> > This is very strange - when using "utf16", endianness should be detected
> > automatically. When I simply truncate the trailing zero byte, I receive:
>
> Any chance that whatever you used to "simply truncate the trailing
> zero byte" also removed the BOM at the start of the file?  Without it,
> utf16 wouldn't be able to detect endianness and would, I believe, fall
> back to native order.

When I read this, I thought "O no, surely not!". Seems that you are
correct:
[Python 2.5.2, Windows XP]
| >>> nobom = u'abcde'.encode('utf_16_be')
| >>> nobom
| '\x00a\x00b\x00c\x00d\x00e'
| >>> nobom.decode('utf16')
| u'\u6100\u6200\u6300\u6400\u6500'

This may well explain one of the Python 3.0 problems that the OP's 2
files exhibit: data appears to have been byte-swapped under some
conditions. Possibility: it is reading the file a chunk at a time and
applying the utf_16 encoding independently to each chunk -- only the
first chunk will have a BOM.




More information about the Python-list mailing list