Python 3.0 automatic decoding of UTF16
Johannes Bauer
dfnsonfsduifb at gmx.de
Fri Dec 5 13:36:21 EST 2008
Terry Reedy schrieb:
> Johannes Bauer wrote:
>> Hello group,
>>
>> I'm having trouble reading a utf-16 encoded file with Python3.0. This is
>> my (complete) code:
>
> what OS. This is often critical when you have a problem interacting
> with the OS.
It's a 64-bit Linux, currently running:
Linux joeserver 2.6.20-skas3-v9-pre9 #4 SMP PREEMPT Wed Dec 3 18:34:49
CET 2008 x86_64 Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz GenuineIntel GNU/Linux
Kernel, however, 2.6.26.1 yields the same problem.
>> Entry00Text = "ADAC Verkehrsinfo"\r\n
>
> From \r\n I guess Windows. Correct?
Well, not really. The file was created with gammu, a Linux opensource
tool to extract a phonebook off cell phones. However, gammu seems to
generate those Windows-CRLF lineendings.
> I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
> but 'uninterpretable as a utf16 character'. The traceback below
> confirms that. It should be an end-of-file marker and should not be
> passed to Python. I strongly suspect that whatever wrote the file
> screwed up the (OS-specific) end-of-file marker. I have seen this
> occasionally on Dos/Windows with ascii byte files, with the same symptom
> of reading random garbage pass the end of the file. Or perhaps
> end-of-file does not work right with utf16.
So UTF-16 has an explicit EOF marker within the text? I cannot find one
in original file, only some kind of starting sequence I suppose
(0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
simple \r\n line ending.
>> is actually the only thing the line contains, Python makes the rest up.
>
> No it does not. It echoes what the OS gives it with system calls, which
> is randon garbage to the end of the disk block.
Could it not be, as Richard suggested, that there's an off-by-one?
> Try open with explicit 'rt' and 'rb' modes and see what happens. Text
> mode should be default, but then \r should be deleted.
rt:
[...]
['[', 'P', 'h', 'o', 'n', 'e', 'P', 'B', 'K', '0', '0', '3', ']', '\n']
['L', 'o', 'c', 'a', 't', 'i', 'o', 'n', ' ', '=', ' ', '0', '0', '3', '\n']
['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'y', 'p', 'e', ' ', '=', ' ',
'N', 'a', 'm', 'e', '\n']
Traceback (most recent call last):
File "./modify.py", line 12, in <module>
a = AddressBook("2008_11_05_Handy_Backup.txt")
File "./modify.py", line 7, in __init__
line = f.readline()
File "/usr/local/lib/python3.0/io.py", line 1807, in readline
while self._read_chunk():
File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/usr/local/lib/python3.0/io.py", line 1293, in decode
output = self.decoder.decode(input, final=final)
File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
illegal encoding
rb works, as it doesn't take an encoding parameter.
> Malformed EOF more likely.
Could you please elaborate?
Kind regards,
Johannes
--
"Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
-- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
<48d8bf1d$0$7510$5402220f at news.sunrise.ch>
More information about the Python-list
mailing list