[I18n-sig] .readline() in codecs (JapaneseCodecs 1.1 with ISO-2022-JP codec)

M.-A. Lemburg mal@lemburg.com
Fri, 24 Nov 2000 22:42:10 +0100


"Martin v. Loewis" wrote:
> 
> > ReadStream.readline() reads lines by calling readline() of
> > the underlying stream object, and converts them into Unicode
> > objects.  Therefore, line breaking is done in the layer of
> > native encodings.  I believe it works well for at least the
> > three Japanese encodings.
> 
> That is the case that may make trouble. Consider u"Hello\nWorld" in
> UTF-16LE; it is
> 
>   H \0 e \0 l \0 l \0 o \0 \n \0 W \0 r \0 l \0 d \0
> 
> Now, if you do readline on the underlying stream, you get
> 
>   H \0 e \0 l \0 l \0 o \0 \n
> 
> Passing that to the UTF-16 decoder causes an exception: this is an
> uneven number of bytes, which is illegal in UTF-16 (it should have
> read \n\0 instead).

Background for codec writers (I never thought there would be so
many of you :-):

The current codecs are a bit naive when it comes to supporting
full Unicode linebreaking (Unicode has many more line break characters
than just CR and/or LF): they let the underlying stream do the
line breaking.

When I wrote the base classes for the codecs I decided not to
add full buffering support (which is needed in order to do .readline()
in the codec) due to the many problems this causes for stream
handling. Esp. wrapped streams would cause serious trouble,
since the read position may not be what the programmer desired...

As codec writer you may choose to support full Unicode
line breaking. To do this you'd have to add buffer support
to .read(), .readline() and .readlines() much in the same
way as is done in StringIO.
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/