UTF-16-LE and split() under MS-Windows XP
Colin S. Miller
colinsm.spam-me-not at picsel.com
Wed Jul 9 05:20:49 EDT 2003
Hi,
I'm trying to parse a UTF-16-LE encoded file, that contains colon
delimited records. The data files was generated by XP's Notepad editor,
and contain a BOM mark. I'm using
Python 2.2.2 (#37, Oct 14 2002, 17:02:34) [MSC 32 bit (Intel)] on win32
My code works if I change the encoding to UTF-16-BE, however for little
endian, it dies with
C:\>unicode_test.py
Traceback (most recent call last):
File "C:\unicode_test.py", line 36, in ?
parse_file("c:\unicode.txt")
File "C:\unicode_test.py", line 27, in parse_file
line = file.readline()
File "c:\Python22\lib\codecs.py", line 330, in readline
return self.reader.readline(size)
File "c:\Python22\lib\codecs.py", line 252, in readline
return self.decode(line, self.errors)[0]
UnicodeError: UTF-16 decoding error: truncated data
C:\>
The source code and unicode data files are attached.
(I know attaching is frowned on, on most groups, but
they are small and I don't want them getting mangled)
Where have I gone wrong, and what is the correct method
to verify the BOM mark?
This snippet
bom = file.read(2)
if (bom != "\xff\xfe"):
print "Data file is not in UTF-16-LE"
return
failes beause
"UnicodeError: ASCII encoding error: ordinal not in range(128)"
According
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=3ee7885a_6%40corp.newsgroups.com&rnum=3&prev=/groups%3Fq%3DUnicodeError:%2BUTF-16%2Bdecoding%2Berror:%2Btruncated%2Bdata%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26selm%3D3ee7885a_6%2540corp.newsgroups.com%26rnum%3D3
(comp.lang.python 11 Jun 2003 'UTF-16 encoding line breaks?')
readline() isn't supported on UTF-16, but doesn't give any
alternative suggestions
TIA,
Colin S. Miller
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: unicode_test.py
URL: <http://mail.python.org/pipermail/python-list/attachments/20030709/5495ec2d/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: unicode.dat
URL: <http://mail.python.org/pipermail/python-list/attachments/20030709/5495ec2d/attachment-0001.ksh>
More information about the Python-list
mailing list