[I18n-sig] JapaneseCodecs 1.1 with ISO-2022-JP codec

Tamito KAJIYAMA kajiyama@grad.sccs.chukyo-u.ac.jp
Sat, 25 Nov 2000 07:20:57 +0900


* Martin v. Loewis
|
| > ReadStream.readline() reads lines by calling readline() of
| > the underlying stream object, and converts them into Unicode
| > objects.  Therefore, line breaking is done in the layer of
| > native encodings.  I believe it works well for at least the
| > three Japanese encodings.
| 
| That is the case that may make trouble. Consider u"Hello\nWorld" in
| UTF-16LE; it is
| 
|   H \0 e \0 l \0 l \0 o \0 \n \0 W \0 r \0 l \0 d \0
| 
| Now, if you do readline on the underlying stream, you get
| 
|   H \0 e \0 l \0 l \0 o \0 \n
| 
| Passing that to the UTF-16 decoder causes an exception: this is an
| uneven number of bytes, which is illegal in UTF-16 (it should have
| read \n\0 instead).
| 
| I was merely asking for confirmation that this is not a problem in
| your encodings (i.e. the byte \012 always means newline, no matter
| where it appears in the encoding).

I see.  In the three Japanese encodings EUC-JP, Shift_JIS and
ISO-2022-JP, the *single* byte \012 always means newline.  So,
most implementations of the Python's file object interface,
which may be the underlying stream of StreamReader, should be
able to deal with newline without trouble.

One question: is there a requirement that a codec must be able
to deal with encodings (e.g. UTF-16LE) other than the intended
encoding (e.g. EUC-JP)?  As of this writing, the Japanese codecs
can only handle the only one intended encoding, and will raise
an exception when a byte within an unexpected range appears in
the input stream.

BTW, in the SourceForge Patch Manager there was a patch for
fixing the problem exactly described in the quotation above:

http://sourceforge.net/patch/?func=detailpatch&patch_id=101477&group_id=5470

The sender of this patch was me (I did not have the SourceForge
account when I posted it).

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>