UTF-16-LE and split() under MS-Windows XP

Colin S. Miller colinsm.spam-me-not at picsel.com
Wed Jul 9 05:20:49 EDT 2003


Hi,

I'm trying to parse a UTF-16-LE encoded file, that contains colon 
delimited records. The data files was generated by XP's Notepad editor, 
and contain a BOM mark. I'm using
Python 2.2.2 (#37, Oct 14 2002, 17:02:34) [MSC 32 bit (Intel)] on win32


My code works if I change the encoding to UTF-16-BE, however for little 
endian, it dies with

C:\>unicode_test.py
Traceback (most recent call last):
   File "C:\unicode_test.py", line 36, in ?
     parse_file("c:\unicode.txt")
   File "C:\unicode_test.py", line 27, in parse_file
     line = file.readline()
   File "c:\Python22\lib\codecs.py", line 330, in readline
     return self.reader.readline(size)
   File "c:\Python22\lib\codecs.py", line 252, in readline
     return self.decode(line, self.errors)[0]
UnicodeError: UTF-16 decoding error: truncated data

C:\>


The source code and unicode data files are attached.
(I know attaching is frowned on, on most groups, but
they are small and I don't want them getting mangled)

Where have I gone wrong, and what is the correct method
to verify the BOM mark?

This snippet
    bom = file.read(2)
    if (bom != "\xff\xfe"):
       print "Data file is not in UTF-16-LE"
       return

failes beause
"UnicodeError: ASCII encoding error: ordinal not in range(128)"


According
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=3ee7885a_6%40corp.newsgroups.com&rnum=3&prev=/groups%3Fq%3DUnicodeError:%2BUTF-16%2Bdecoding%2Berror:%2Btruncated%2Bdata%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26selm%3D3ee7885a_6%2540corp.newsgroups.com%26rnum%3D3
(comp.lang.python 11 Jun 2003 'UTF-16 encoding line breaks?')
readline() isn't supported on UTF-16, but doesn't give any
alternative suggestions

TIA,
Colin S. Miller

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: unicode_test.py
URL: <http://mail.python.org/pipermail/python-list/attachments/20030709/5495ec2d/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: unicode.dat
URL: <http://mail.python.org/pipermail/python-list/attachments/20030709/5495ec2d/attachment-0001.ksh>


More information about the Python-list mailing list