Python 3.0 automatic decoding of UTF16

Terry Reedy tjreedy at udel.edu
Sun Dec 7 04:15:29 EST 2008


John Machin wrote:

> Here's the scoop: It's a bug in the newline handling (in io.py, class
> IncrementalNewlineDecoder, method decode). It reads text files in 128-
> byte chunks. Converting CR LF to \n requires special case handling
> when '\r' is detected at the end of the decoded chunk n in case
> there's an LF at the start of chunk n+1. Buggy solution: prepend b'\r'
> to the chunk n+1 bytes and decode that -- suddenly with a 2-bytes-per-
> char encoding like UTF-16 we are 1 byte out of whack. Better (IMVH[1]
> O) solution: prepend '\r' to the result of decoding the chunk n+1
> bytes. Each of the OP's files have \r on a 64-character boundary.
> Note: They would exhibit the same symptoms if encoded in utf-16LE
> instead of utf-16BE. With the better solution applied, the first file
> [the truncated one] gave the expected error, and the second file [the
> apparently OK one] gave sensible looking output.
> 
> [1] I thought it best to be Very Humble given what you see when you
> do:
>    import io
>    print(io.__author__)
> Hope my surge protector can cope with this :-)
> ^%!//()
> NO CARRIER

Please post this on the tracker so it can get included with other io 
work for 3.0.1.




More information about the Python-list mailing list