[ python-Bugs-904474 ] File read of Chinese utf-16-le treats upper
byte 1A as EOF
SourceForge.net
noreply at sourceforge.net
Wed Feb 25 17:53:46 EST 2004
Bugs item #904474, was opened at 2004-02-25 20:30
Message generated for change (Comment added) made by lemburg
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=904474&group_id=5470
>Category: Unicode
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Ron Rother (rrother)
Assigned to: Nobody/Anonymous (nobody)
Summary: File read of Chinese utf-16-le treats upper byte 1A as EOF
Initial Comment:
Any utf-16-le Chinese character with 1A as the most
significant byte causes remainder of file to be ignored.
code extract:
(utf16_encoder, utf16_decoder, utf16_reader,
utf16_writer) = codecs.lookup("utf-16-le")
ifile = utf16_reader(open(sys.argv[1],"r"))
t=ifile.read()
When the Chinese character 1A 5C (尚) is encoundered,
everthing from the 5C is discarded.
These 3 lines:
English="You have not selected any books!"
Context=1,[MsgBox "You have not selected any books!"]
Chinese(Simplified)="尚未选择任何书卷!"
are input as:
English="You have not selected any books!"
Context=1,[MsgBox "You have not selected any books!"]
Chinese(Simplified)="
----------------------------------------------------------------------
>Comment By: M.-A. Lemburg (lemburg)
Date: 2004-02-25 23:53
Message:
Logged In: YES
user_id=38388
I believe there is a misconception here: the open(..., "r")
will cause the file to be opened in C lib's text mode. Since
UTF-16 is binary data, this will lead to problems with line
breaking
and file handling in general.
You should try:
import codecs
ifile = codecs.open(filename, 'rb', encoding='utf-16-le')
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=904474&group_id=5470
More information about the Python-bugs-list
mailing list