Python 3.0 automatic decoding of UTF16

Fri Dec 5 14:15:33 EST 2008

On Dec 5, 3:25 pm, Johannes Bauer <dfnsonfsdu... at gmx.de> wrote:
> Hello group,
>
> I'm having trouble reading a utf-16 encoded file with Python3.0. This is
> my (complete) code:
>
> #!/usr/bin/python3.0
>
> class AddressBook():
>         def __init__(self, filename):
>                 f = open(filename, "r", encoding="utf16")
>                 while True:
>                         line = f.readline()
>                         if line == "": break
>                         print([line[x] for x in range(len(line))])
>                 f.close()
>
> a = AddressBook("2008_11_05_Handy_Backup.txt")
>
> This is the file (only 1 kB, if hosting doesn't work please tell me and
> I'll see if I can put it someplace else):
>
> http://www.file-upload.net/download-1297291/2008_11_05_Handy_Backup.t...
>
> What I get: The file reads file the first few lines. Then, in the last
> line, I get lots of garbage (looking like uninitialized memory):
>
> ['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'e', 'x', 't', ' ', '=', ' ',
> '"', 'A', 'D', 'A', 'C', ' ', 'V', 'e', 'r', 'k', 'e', 'h', 'r', 's',
> 'i', 'n', 'f', 'o', '"', '\u0d00', '\u0a00', '䔀', '渀', '琀', '爀', '礀
> ', '\u3000', '\u3100', '吀', '礀', '瀀', '攀', '\u2000', '㴀', '\u2000',
> '一', '甀', '洀', '戀', '攀', '爀', '䴀', '漀', '戀', '椀', '氀', '攀',
> '\u0d00', '\u0a00', '䔀', '渀', '琀', '爀', '礀', '\u3000', '\u3100', '
> 吀', '攀', '砀', '琀', '\u2000', '㴀', '\u2000', '∀', '⬀', '㐀', '㤀',
> '\u3100', '㜀', '㤀', '㈀', '㈀', '㐀', '㤀', '㤀', '∀', '\u0d00',
> '\u0a00', '\u0d00', '\u0a00', '嬀', '倀', '栀', '漀', '渀', '攀', '倀',
> '䈀', '䬀', '\u3000', '\u3000', '㐀', '崀', '\u0d00', '\u0a00']
>
> Where the line
>
> Entry00Text = "ADAC Verkehrsinfo"\r\n
>
> is actually the only thing the line contains, Python makes the rest up.
>
> The actual file is much longer and contains private numbers, so I
> truncated them away. When I let python process the original file, it
> dies with another error:
>
> Traceback (most recent call last):
>   File "./modify.py", line 12, in <module>
>     a = AddressBook("2008_11_05_Handy_Backup.txt")
>   File "./modify.py", line 7, in __init__
>     line = f.readline()
>   File "/usr/local/lib/python3.0/io.py", line 1807, in readline
>     while self._read_chunk():
>   File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
>     self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
>   File "/usr/local/lib/python3.0/io.py", line 1293, in decode
>     output = self.decoder.decode(input, final=final)
>   File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
>   File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
> _buffer_decode
>     return self.decoder(input, self.errors, final)
> UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
> illegal encoding
>
> With the place where it dies being exactly the place where it outputs
> the weird garbage in the shortened file. I guess it runs over some page
> boundary here or something?
>
> Kind regards,
> Johannes
>
> --
> "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
> verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
>          -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
>                          <48d8bf1d$0$7510$54022... at news.sunrise.ch>

2 problems: endianness and trailing zer byte.
This works for me:

class AddressBook():
    def __init__(self, filename):
        f = open(filename, "r", encoding="utf_16_be", newline="\r\n")
        while True:
            line = f.readline()
            if len(line) == 0:
                break
            print (line.replace("\r\n",""))
        f.close()

a = AddressBook("2008_11_05_Handy_Backup2.txt")

Please note the filename: I modified your file by dropping the
trailing zer byte