[Patches] [ python-Patches-998993 ] Decoding incomplete unicode

Tue Jul 27 22:35:29 CEST 2004

Patches item #998993, was opened at 2004-07-27 22:35
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=998993&group_id=5470

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: Nobody/Anonymous (nobody)
Summary: Decoding incomplete unicode

Initial Comment:
Pythons unicode machinery currently has problems when 
decoding incomplete input.

When codecs.StreamReader.read() encounters a 
decoding error it reads more bytes from the input stream 
and retries decoding. This is broken for two reasons:
1) The error might be due to a malformed byte sequence 
in the input, a problem that can't be fixed by reading 
more bytes.
2) There may be no more bytes available at this time. 
Once more data is available decoding can't continue 
because bytes from the input stream have already been 
read and thrown away.

(sio.DecodingInputFilter has the same problems)

To fix this, three changes are required:
a) We need stateful versions of the decoding functions 
that don't raise "truncated data" exceptions, but decode 
as much as possible and return the position where 
decoding stopped.
b) The StreamReader classes need to use those stateful 
versions of the decoding functions.
c) codecs.StreamReader needs to keep an internal 
buffer with the bytes read from the input stream that 
haven't been decoded into unicode yet.

For a) the Python API already exists: All decoding 
functions in the codecs module return a tuple containing 
the decoded unicode object and the number of bytes 
consumed. But this functionality isn't implemented in the 
decoders:

codec.utf_8_decode(u"aä".encode("utf-8")[:-1])
raises an exception instead of returning (u"a", 1).

This can be fixed by extending the UTF-8 and UTF-16 
decoding functions like this:

PyObject *PyUnicode_DecodeUTF8Stateful(
   const char *s, int size,
   const char *errors, int *consumed)

If consumed == NULL PyUnicode_DecodeUTF8Stateful() 
behaves like PyUnicode_DecodeUTF8() (i.e. it raises 
a "truncated data" exception). If consumed != NULL it 
decodes as much as possible (raising exceptions for 
invalid byte sequences) and puts the number of bytes 
consumed into *consumed.

Additionally for UTF-7 we need to pass the decoder 
state around.

An implementation of c) looks like this:

def read(self, size=-1):
    if size < 0:
        data = self.bytebuffer+self.stream.read()
    else:
        data = self.bytebuffer+self.stream.read(size)
    (object, decodedbytes) = self.decode(data, 
self.errors)
    self.bytebuffer = data[decodedbytes:]
    return object

Unfortunately this changes the semantics. read() might 
return an empty string even if there would be more data 
available. But this can be fixed if we continue reading 
until at least one character is available.

The patch implements a few additional features:
read() has an additional argument chars that can be 
used to specify the number of characters that should be 
returned.

readline() is supported on all readers derived from 
codecs.StreamReader().

readline() and readlines() have an additional option for 
dropping the u"\n".

The patch is still missing changes to the escape codecs
("unicode_escape" and "raw_unicode_escape"), but it
has test cases that check the new functionality for all
affected codecs (UTF-7, UTF-8, UTF-16, UTF-16-LE,
UTF-16-BE).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=998993&group_id=5470