[ python-Bugs-904474 ] File read of Chinese utf-16-le treats upper byte 1A as EOF

Wed Feb 25 17:53:46 EST 2004

Bugs item #904474, was opened at 2004-02-25 20:30
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=904474&group_id=5470

>Category: Unicode
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Ron Rother (rrother)
Assigned to: Nobody/Anonymous (nobody)
Summary: File read of Chinese utf-16-le treats upper byte 1A as EOF

Initial Comment:
Any utf-16-le Chinese character with 1A as the most 
significant byte causes remainder of file to be ignored.

code extract:

(utf16_encoder, utf16_decoder, utf16_reader, 
utf16_writer) = codecs.lookup("utf-16-le")

ifile = utf16_reader(open(sys.argv[1],"r"))

t=ifile.read()

When the Chinese character 1A 5C (&#23578;) is encoundered, 
everthing from the 5C is discarded.

These 3 lines:
English="You have not selected any books!"
Context=1,[MsgBox "You have not selected any books!"]
Chinese(Simplified)="&#23578;&#26410;&#36873;&#25321;&#20219;&#20309;&#20070;&#21367;&#65281;"

are input as:
English="You have not selected any books!"
Context=1,[MsgBox "You have not selected any books!"]
Chinese(Simplified)="

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2004-02-25 23:53

Message:
Logged In: YES 
user_id=38388

I believe there is a misconception here: the open(..., "r")
will cause the file to be opened in C lib's text mode. Since
UTF-16 is binary data, this will lead to problems with line
breaking
and file handling in general.

You should try:

import codecs
ifile = codecs.open(filename, 'rb', encoding='utf-16-le')

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=904474&group_id=5470