Python 3.0 automatic decoding of UTF16

Sat Dec 6 22:30:40 EST 2008

On Dec 7, 9:34 am, John Machin <sjmac... at lexicon.net> wrote:
> On Dec 7, 9:01 am, David Bolen <db3l.... at gmail.com> wrote:
>
> > Johannes Bauer <dfnsonfsdu... at gmx.de> writes:
> > > This is very strange - when using "utf16", endianness should be detected
> > > automatically. When I simply truncate the trailing zero byte, I receive:
>
> > Any chance that whatever you used to "simply truncate the trailing
> > zero byte" also removed the BOM at the start of the file?  Without it,
> > utf16 wouldn't be able to detect endianness and would, I believe, fall
> > back to native order.
>
> When I read this, I thought "O no, surely not!". Seems that you are
> correct:
> [Python 2.5.2, Windows XP]
> | >>> nobom = u'abcde'.encode('utf_16_be')
> | >>> nobom
> | '\x00a\x00b\x00c\x00d\x00e'
> | >>> nobom.decode('utf16')
> | u'\u6100\u6200\u6300\u6400\u6500'
>
> This may well explain one of the Python 3.0 problems that the OP's 2
> files exhibit: data appears to have been byte-swapped under some
> conditions. Possibility: it is reading the file a chunk at a time and
> applying the utf_16 encoding independently to each chunk -- only the
> first chunk will have a BOM.

Well, no, on further investigation, we're not byte-swapped, we've
tricked ourselves into decoding on odd-byte boundaries.

Here's the scoop: It's a bug in the newline handling (in io.py, class
IncrementalNewlineDecoder, method decode). It reads text files in 128-
byte chunks. Converting CR LF to \n requires special case handling
when '\r' is detected at the end of the decoded chunk n in case
there's an LF at the start of chunk n+1. Buggy solution: prepend b'\r'
to the chunk n+1 bytes and decode that -- suddenly with a 2-bytes-per-
char encoding like UTF-16 we are 1 byte out of whack. Better (IMVH[1]
O) solution: prepend '\r' to the result of decoding the chunk n+1
bytes. Each of the OP's files have \r on a 64-character boundary.
Note: They would exhibit the same symptoms if encoded in utf-16LE
instead of utf-16BE. With the better solution applied, the first file
[the truncated one] gave the expected error, and the second file [the
apparently OK one] gave sensible looking output.

[1] I thought it best to be Very Humble given what you see when you
do:
   import io
   print(io.__author__)
Hope my surge protector can cope with this :-)
^%!//()
NO CARRIER