Python 3.0 automatic decoding of UTF16

Fri Dec 5 13:36:21 EST 2008

Terry Reedy schrieb:
> Johannes Bauer wrote:
>> Hello group,
>>
>> I'm having trouble reading a utf-16 encoded file with Python3.0. This is
>> my (complete) code:
> 
> what OS.  This is often critical when you have a problem interacting
> with the OS.

It's a 64-bit Linux, currently running:

Linux joeserver 2.6.20-skas3-v9-pre9 #4 SMP PREEMPT Wed Dec 3 18:34:49
CET 2008 x86_64 Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz GenuineIntel GNU/Linux

Kernel, however, 2.6.26.1 yields the same problem.

>> Entry00Text = "ADAC Verkehrsinfo"\r\n
> 
> From \r\n I guess Windows.  Correct?

Well, not really. The file was created with gammu, a Linux opensource
tool to extract a phonebook off cell phones. However, gammu seems to
generate those Windows-CRLF lineendings.

> I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
> but 'uninterpretable as a utf16 character'.  The traceback below
> confirms that.  It should be an end-of-file marker and should not be
> passed to Python.  I strongly suspect that whatever wrote the file
> screwed up the (OS-specific) end-of-file marker.  I have seen this
> occasionally on Dos/Windows with ascii byte files, with the same symptom
> of reading random garbage pass the end of the file.  Or perhaps
> end-of-file does not work right with utf16.

So UTF-16 has an explicit EOF marker within the text? I cannot find one
in original file, only some kind of starting sequence I suppose
(0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
simple \r\n line ending.

>> is actually the only thing the line contains, Python makes the rest up.
> 
> No it does not.  It echoes what the OS gives it with system calls, which
> is randon garbage to the end of the disk block.

Could it not be, as Richard suggested, that there's an off-by-one?

> Try open with explicit 'rt' and 'rb' modes and see what happens.  Text
> mode should be default, but then \r should be deleted.

rt:

[...]
['[', 'P', 'h', 'o', 'n', 'e', 'P', 'B', 'K', '0', '0', '3', ']', '\n']
['L', 'o', 'c', 'a', 't', 'i', 'o', 'n', ' ', '=', ' ', '0', '0', '3', '\n']
['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'y', 'p', 'e', ' ', '=', ' ',
'N', 'a', 'm', 'e', '\n']
Traceback (most recent call last):
  File "./modify.py", line 12, in <module>
    a = AddressBook("2008_11_05_Handy_Backup.txt")
  File "./modify.py", line 7, in __init__
    line = f.readline()
  File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    while self._read_chunk():
  File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    output = self.decoder.decode(input, final=final)
  File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
    return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
illegal encoding

rb works, as it doesn't take an encoding parameter.

> Malformed EOF more likely.

Could you please elaborate?

Kind regards,
Johannes

-- 
"Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
         -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
                         <48d8bf1d$0$7510$5402220f at news.sunrise.ch>