[Python-Dev] Decoding incomplete unicode

Walter Dörwald walter at livinglogic.de
Tue Jul 27 22:39:45 CEST 2004


Pythons unicode machinery currently has problems when decoding
incomplete input.

When codecs.StreamReader.read() encounters a decoding error it
reads more bytes from the input stream and retries decoding.
This is broken for two reasons:
1) The error might be due to a malformed byte sequence in the input,
    a problem that can't be fixed by reading more bytes.
2) There may be no more bytes available at this time. Once more
    data is available decoding can't continue because bytes from
    the input stream have already been read and thrown away.
(sio.DecodingInputFilter has the same problems)

I've uploaded a patch that fixes these problems to SF:
http://www.python.org/sf/998993

The patch implements a few additional features:
- read() has an additional argument chars that can be used to
   specify the number of characters that should be returned.
- readline() is supported on all readers derived from
   codecs.StreamReader().
- readline() and readlines() have an additional option
   for dropping the u"\n".

The patch is still missing changes to the escape codecs
("unicode_escape" and "raw_unicode_escape") and I haven't
touched the CJK codecs, but it has test cases that check
the new functionality for all affected codecs
(UTF-7, UTF-8, UTF-16, UTF-16-LE, UTF-16-BE).

Could someone take a look at the patch?

Bye,
    Walter Dörwald




More information about the Python-Dev mailing list