split() can help to read UTF-16 encoded file without codecs support, why?
Fuzzyman
fuzzyman at gmail.com
Fri Mar 17 04:39:07 EST 2006
Zhongjian Lu wrote:
> Hi Guys,
>
> I was processing a UTF-16 coded file with BOM and was not aware of the
> codecs package at first. I wrote the following code:
> ===== Code 1============================
> for i in open("d:\python24\lzjtest.xml", 'r').readlines():
> i = i.decode("utf-16")
> print i
> =======================================
> Output was:
> Traceback (most recent call last):
> File "D:\Python24\testutf-16.py", line 4, in -toplevel-
> i = i.decode("utf-16")
> File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode
> return codecs.utf_16_decode(input, errors, True)
> UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position
> 84: truncated data
>
UTF16 is a 'two-byte encoding'. This means that '\r\n' is represented
using :
'\r\x00\n\x00'
When you use readlines to split this up it splits on byte boundaries.
This probably returns something like :
'\r', '\x00\n', '\x00'
You can see how the last bit is 'truncated' (single byte only) because
the data has been split on bytes instead of characters.
> I searched google and found an article on the similar problem saying to use
> split(). I had not quite caught the meaning of the article and recode as:
> ==== Code 2==============================
> for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'):
> i = i.decode("utf-16")
> print i
> =======================================
> Then it worked (echo the file).
>
You will probably find that '\r\n' never occurs in the byte-string, so
this does it *all* in one line, but the decode succeeds.
HTH
All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
> Later I get to know codecs and write the following code:
>
> ==== Code 3 =============================
> import codecs
> for i in codecs.open("d:\python24\lzjtesttvs2.xml", 'r', 'utf-16').readlines():
> print i
> =======================================
> It worked and echo the file.
>
> I am wondering what is the problem with the first code and why the bug
> is fixed in
> the second.
>
> Thanks in advance.
>
> -Zhongjian
More information about the Python-list
mailing list